Explainability-aided Domain Generalization for Image Classification

基于可解释性的图像分类领域泛化方法

Author: Supervisor: Robin M. Schmidt Dr. Massimiliano Mancini

作者:导师:Robin M. Schmidt 博士 Massimiliano Mancini

Co-Supervisor:

联合导师:

Prof. Dr. Zeynep AKATA

教授 博士 Zeynep AKATA

Reviewer:

评审:

Prof. Dr. Philipp Hennig EBERHARD KARLS UNIVERSITAT TUBINGEN

教授 博士 Philipp Hennig 埃伯哈德·卡尔斯·图宾根大学

A thesis submitted in fulfillment of the requirements

为完成以下学位要求而提交的论文

for the degree of Master of Science (M.Sc.) in Computer Science

计算机科学硕士(M.Sc.)学位

in the

Department of Computer Science

计算机科学系

Explainable Machine Learning Group

可解释机器学习组

Abstract of thesis entitled

题为

Explainability-aided Domain Generalization for Image Classification

基于可解释性的图像分类领域泛化方法

Submitted by Robin M. Schmidt for the degree of Master of Science (M.Sc.) at the University of Tübingen in April, 2021

由Robin M. Schmidt提交,申请图宾根大学计算机科学硕士(M.Sc.)学位,2021年4月

Traditionally, for most machine learning settings, gaining some degree of explainability that tries to give users more insights into how and why the network arrives at its predictions, restricts the underlying model and hinders performance to a certain degree. For example, decision trees are thought of as being more explainable than deep neural networks but they lack performance on visual tasks. In this work, we empirically demonstrate that applying methods and architectures from the explainability literature can, in fact, achieve state-of-the-art performance for the challenging task of domain generalization while offering a framework for more insights into the prediction and training process. For that, we develop a set of novel algorithms including DIvCAM, an approach where the network receives guidance during training via gradient based class activation maps to focus on a diverse set of discriminative features, as well as ProDROP and D-TRANSFORMERS which apply prototypical networks to the domain generalization task, either with self-challenging or attention alignment. Since these methods offer competitive performance on top of explainability, we argue that the proposed methods can be used as a tool to improve the robustness of deep neural network architectures. Copyright ©2021, by Robin M. Schmidt ALL RIGHTS RESERVED.

传统上,在大多数机器学习场景中,获得一定程度的可解释性以帮助用户更好地理解网络如何及为何做出预测,往往会限制底层模型并在一定程度上影响性能。例如,决策树被认为比深度神经网络更具可解释性,但在视觉任务上的表现较差。在本研究中,我们通过实验证明,应用可解释性文献中的方法和架构,实际上能够在具有挑战性的领域泛化任务中实现最先进的性能,同时提供一个框架以深入理解预测和训练过程。为此,我们开发了一系列新算法,包括DIvCAM,一种通过基于梯度的类激活图在训练期间引导网络关注多样化判别特征的方法,以及ProDROP和D-TRANSFORMERS,它们将原型网络应用于领域泛化任务,分别结合自我挑战或注意力对齐。鉴于这些方法在提供可解释性的基础上表现出竞争力的性能,我们认为所提方法可作为提升深度神经网络架构鲁棒性的工具。版权所有©2021,Robin M. Schmidt 保留所有权利。

Declaration

声明

I, Robin M. SCHMIDT, declare that this thesis titled "Explainability-aided Domain Generalization for Image Classification" which is submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science represents my own work except where acknowledgements have been made. I further declare that this work has not been previously included as a whole or in parts in a thesis, dissertation, or report submitted to this university or to any other institution for a degree, diploma or other qualifications.

我,Robin M. SCHMIDT,声明本论文题为“基于可解释性的图像分类领域泛化”,为完成计算机科学理学硕士学位的要求所提交,除致谢部分外,均为本人独立完成。我进一步声明,本论文未曾作为整体或部分内容提交给本校或任何其他机构用于申请学位、文凭或其他资格。

Contents

目录

1 Introduction 1

1 引言 1

2 Domain Generalization 3

2 领域泛化 3

2.1 Problem formulation 3

2.1 问题表述 3

2.2 Related concepts and their differences 5

2.2 相关概念及其区别 5

2.2.1 Generic Neural Network Regularization 6

2.2.1 通用神经网络正则化 6

2.2.2 Domain Adaptation 6

2.2.2 领域适应 6

2.3 Previous Works 6

2.3 相关工作 6

2.3.1 Learning invariant features 7

2.3.1 学习不变特征 7

2.3.2 Model ensembling 8

2.3.2 模型集成 8

2.3.3 Meta-learning 8

2.3.3 元学习 8

2.3.4 Data Augmentation 8

2.3.4 数据增强 8

2.4 Common Datasets 9

2.4 常用数据集 9

2.4.1 Rotated MNIST 9

2.4.1 旋转MNIST 9

2.4.2 Colored MNIST 9

2.4.2 彩色MNIST 9

2.4.3 Office-Home 9

2.4.3 Office-Home 9

2.4.4 VLCS 11

2.4.4 VLCS 11

2.4.5 PACS 11

2.4.5 PACS 11

2.4.6 Terra Incognita 11

2.4.6 Terra Incognita 11

2.4.7 DomainNet 11

2.4.7 DomainNet 11

2.4.8 ImageNet-C 11

2.4.8 ImageNet-C 11

2.5 Considerations regarding model validation 12

2.5 关于模型验证的考虑 12

2.6 Deep-Dive into Representation Self-Challenging 12

2.6 表征自我挑战的深入探讨 12

3 Explainability in Deep Learning 15

3 深度学习中的可解释性 15

3.1 Related topics 16

3.1 相关主题 16

3.1.1 Model Debugging 16

3.1.1 模型调试 16

3.1.2 Fairness and Bias 16

3.1.2 公平性与偏差 16

3.2 Previous Works 17

3.2 以往工作 17

3.2.1 Visualization 17

3.2.1 可视化 17

Back-Propagation 17

反向传播(Back-Propagation) 17

Perturbation 18

扰动(Perturbation) 18

3.2.2 Model distillation 19

3.2.2 模型蒸馏 19

Local approximations 19

局部近似 19

Model Translation 19

模型翻译 19

3.2.3 Intrinsic methods 19

3.2.3 内在方法 19

Attention mechanism 19

注意力机制 19

Text explanations 20

文本解释 20

Explanation association 20

解释关联 20

Prototypes 20

原型 20

3.3 Explainability for Domain Generalization 22

3.3 域泛化的可解释性 22

4 Proposed Methods 23

4 提出的方法 23

4.1 Diversified Class Activation Maps (DIVCAM) 23

4.1 多样化类激活图(DIVCAM) 23

4.1.1 Global Average Pooling bias for small activation areas 25

4.1.1 小激活区域的全局平均池化偏差 25

4.1.2 Smoothing negative Class Activation Maps 25

4.1.2 平滑负类激活图 25

4.1.3 Conditional Domain Adversarial Neural Networks 27

4.1.3 条件域对抗神经网络 27

4.1.4 Maximum Mean Discrepancy 27

4.1.4 最大均值差异(Maximum Mean Discrepancy) 27

4.2 Prototype Networks for Domain Generalization 28

4.2 用于域泛化的原型网络 28

4.2.1 Ensemble Prototype Network 29

4.2.1 集成原型网络 29

4.2.2 Diversified Prototypes (ProDrop) 30

4.2.2 多样化原型(ProDrop) 30

4.2.3 Using Support Sets (D-TRANSFORMERS) 33

4.2.3 使用支持集(D-TRANSFORMERS) 33

5 Experiments 35

5 实验 35

5.1 Datasets and splits 35

5.1 数据集与划分 35

5.2 Hyperparameter Distributions & Schedules 35

5.2 超参数分布与调度 35

5.3 Results 35

5.3 结果 35

5.4 Ablation Studies 37

5.4 消融研究 37

5.4.1 Hyperparameter Distributions & Schedules 37

5.4.1 超参数分布与调度 37

5.4.2 DIVCAM: Mask Batching 38

5.4.2 DIVCAM:掩码批处理 38

5.4.3 DIVCAM: Class Activation Maps 39

5.4.3 DIVCAM:类别激活图 39

5.4.4 ProDrop: Self-Challenging 39

5.4.4 ProDrop:自我挑战 39

5.4.5 ProDrop: Intra-Loss 40

5.4.5 ProDrop:内部损失 40

6 Conclusion and Outlook 43

6 结论与展望 43

Bibliography 45

参考文献 45

A Domain-specific results 57

A 领域特定结果 57

B Additional distance plots 63

B 额外距离图 63

List of Figures

图表目录

2.1 Meta-distribution D generating source and unseen domain distributions 5

2.1 生成源域和未见域分布的元分布D(Meta-distribution D) 5

3.1 Class activation maps across different architectures 18

3.1 不同架构下的类别激活图 18

3.2 Prototypes for the MNIST and Car dataset 21

3.2 MNIST和汽车数据集的原型 21

3.3 Image of a clay colored sparrow and its decomposition into prototypes 21

3.3 一只粘土色麻雀的图像及其原型分解 21

4.1 Visualization of the DIVCAM training process 24

4.1 DIVCAM训练过程的可视化 24

4.2 Used class activation maps in DIVCAM-S throughout training 26

4.2 DIVCAM-S训练过程中使用的类别激活图 26

4.3 Domain-agnostic Prototype Network 28

4.3 域无关原型网络(Domain-agnostic Prototype Network) 28

4.4 Ensemble Prototype Network 29

4.4 集成原型网络(Ensemble Prototype Network) 29

4.5 Second data split pairwise prototype distances with wc,j=1.0 . 31

4.5 第二数据划分的成对原型距离与wc,j=1.0 31

4.6 Second data split pairwise self-challenging prototype distances with wc,j=1.0 33

4.6 第二数据划分的成对自我挑战原型距离与wc,j=1.0 33

B. 1 First data split pairwise prototype distances with wc,j=1.0 63

B. 1 第一数据划分的成对原型距离与wc,j=1.0 63

B. 2 Third data split pairwise prototype distances with wc,j=1.0 64

B. 2 第三数据划分的成对原型距离与wc,j=1.0 64

B. 3 First data split pairwise self-challenging prototype distances with wc,j=1.0 64

B. 3 第一数据划分的成对自我挑战原型距离与wc,j=1.0 64

B. 4 Third data split pairwise self-challenging prototype distances with wc,j=1.0

B. 4 第三数据划分的成对自我挑战原型距离与wc,j=1.0

B. 5 First data split pairwise prototype distances with wc,j=0.0 65

B. 5 第一数据划分成对原型距离与wc,j=0.0 65

B. 6 Second data split pairwise prototype distances with wc,j=0.0

B. 6 第二数据划分成对原型距离与wc,j=0.0

B. 7 Third data split pairwise prototype distances with wc,j=0.0 66

B. 7 第三数据划分成对原型距离与wc,j=0.0 66

B. 8 First data split pairwise self-challenging prototype distances with wc,j=0.0

B. 8 第一数据划分成对自我挑战原型距离与wc,j=0.0

B. 9 Second data split pairwise self-challenging prototype distances with wc,j=0.0 67

B. 9 第二数据划分成对自我挑战原型距离与wc,j=0.0 67

B. 10 Third data split pairwise self-challenging prototype distances with wc,j=0.0 68

B. 10 第三数据划分成对自我挑战原型距离与wc,j=0.0 68

List of Tables

表格列表

2.1 Differences in learning setups 5

2.1 学习设置的差异 5

2.2 Samples for two different classes across domains for popular datasets 10

2.2 流行数据集中跨域两个不同类别的样本 10

2.3 Reproduced results for Representation Self-Challenging using the official code base 14

2.3 使用官方代码库复现的表示自我挑战(Representation Self-Challenging)结果 14

5.1 Performance comparison of the proposed methods on the PACS dataset 36

5.1 提出方法在PACS数据集上的性能比较 36

5.2 Performance comparison across datasets 37

5.2 跨数据集的性能比较 37

5.3 Performance comparison for official PACS splits outside of DomAINBED 38

5.3 DomAINBED之外官方PACS划分的性能比较 38

5.4 Hyperparameters and distributions used for the mask batching ablation study 38

5.4 用于掩码批处理消融研究的超参数和分布 38

5.5 Hyperparameters and distributions used for the mask ablation study 39

5.5 用于掩码消融研究的超参数和分布 39

5.6 Ablation study for the DIVCAM mask batching on the PACS dataset 40

5.6 PACS数据集上DIVCAM掩码批处理的消融研究 40

5.7 Ablation study for the DIVCAM masks on the PACS dataset 41

5.7 PACS数据集上DIVCAM掩码的消融研究 41

5.8 Self-challenging performance comparison for different negative class weights 41

5.8 不同负类权重下自我挑战性能比较 41

5.9 Self-challenging performance comparison for different intra factors 42

5.9 不同内部因子下自我挑战性能比较 42

A. 1 Domain specific performance for the VLCS dataset 58

A. 1 VLCS数据集的领域特定性能 58

A. 2 Domain specific performance for the PACS dataset 59

A. 2 PACS数据集的领域特定性能 59

A. 3 Domain specific performance for the Office-Home dataset 60

A. 3 Office-Home数据集的领域特定性能 60

A. 4 Domain specific performance for the Terra Incognita dataset 61

A. 4 Terra Incognita数据集的领域特定性能 61

A. 5 Domain specific performance for the DomainNet dataset 62

A. 5 DomainNet数据集的领域特定性能 62

List of Algorithms

算法列表

1 Spatial- and Channel-Wise RSC 14

1 空间和通道维度的RSC 14

2 Diversified Class Activation Maps (DIVCAM) 25

2 多样化类激活图(DIVCAM) 25

3 Prototype Dropping (ProDROP) 32

3 原型丢弃(ProDROP) 32

List of Abbreviations

缩略语列表

ADAM ADAptive Moment Estimation

ADAM 自适应矩估计(ADAptive Moment Estimation)

AUC Area Under the Curve

AUC 曲线下面积(Area Under the Curve)

CAM Class Activation Maps

CAM 类激活图(Class Activation Maps)

CDANN Conditional Domain Adversarial Neural Network

CDANN 条件域对抗神经网络

CE Cross-Entropy

CE 交叉熵

CNN Convolutional Neural Network

CNN 卷积神经网络

CMNIST Colored MNIST

CMNIST 彩色MNIST

DA Domain Adaptation

DA 域适应

DANN Domain Adversarial Neural Network

DANN 域对抗神经网络

DeepLIFT Deep Learning Important Features

DeepLIFT 深度学习重要特征

DG Domain Generalization

DG 域泛化

DivCAM Diversified Class Activation Maps

DivCAM 多样化类别激活图

DNN Deep Neural Network

DNN 深度神经网络

ERM Empirical Risk Minimization

ERM 经验风险最小化

FC Fully Connected Layer

FC 全连接层

GAN Generative Adversarial Network

GAN 生成对抗网络

GAP Global Average Pooling

GAP 全局平均池化

Grad-CAM Gradient-weighted Class Activation Maps

Grad-CAM 梯度加权类别激活图

HNC Homogeneous Negative Class Activation Maps

HNC 同质负类激活图

I.I.D. Independent and Identically Distributed

I.I.D. 独立同分布(Independent and Identically Distributed)

KL Kullback-Leibler

KL 库尔贝克-莱布勒散度(Kullback-Leibler)

LIME Local Interpretable Model-agnostic Explanations

LIME 局部可解释模型无关解释(Local Interpretable Model-agnostic Explanations)

MAML Model-Agnostic Meta-Learning

MAML 模型无关元学习(Model-Agnostic Meta-Learning)

MMD Maximum Mean Discrepancy

MMD 最大均值差异(Maximum Mean Discrepancy)

MTAE Multi-Task Autoencoder

MTAE 多任务自编码器(Multi-Task Autoencoder)

NLP Natural Language Processing

NLP 自然语言处理(Natural Language Processing)

ProDrop Prototype Dropping

ProDrop 原型丢弃(Prototype Dropping)

ReLU Rectified Linear Unit

ReLU 修正线性单元(Rectified Linear Unit)

RSC Representation Self-Challenging

RSC 表征自我挑战(Representation Self-Challenging)

RKHS Reproducing Kernel Hilbert Space

RKHS 再生核希尔伯特空间(Reproducing Kernel Hilbert Space)

RMNIST Rotated MNIST

RMNIST 旋转MNIST(Rotated MNIST)

SGD Stochastic Gradient Descent

SGD 随机梯度下降(Stochastic Gradient Descent)

SVM Support Vector Machine

SVM 支持向量机(Support Vector Machine)

TAP Threshold Average Pooling

TAP 阈值平均池化(Threshold Average Pooling)

UML Unbiased Metric Learning

UML 无偏度量学习(Unbiased Metric Learning)

List of Symbols

符号列表

Chapter 2

第二章

xi instance of input features

xi 输入特征的实例

yi corresponding label for input features

yi 输入特征对应的标签

y˙i instance of label one-hot encoding

y˙i 标签的独热编码实例

(xi,yi) sample of input and label pair

(xi,yi) 输入与标签对的样本

y^ predicted output label

y^ 预测输出标签

Xrandom variable for input features

X 输入特征的随机变量

Yrandom variable for output labels

Y 输出标签的随机变量

Cnumber of classes

C 类别数

Dtraining dataset

D 训练数据集

Dtraining distribution

D 训练分布

Zfeature representation

Z 特征表示

z~ masked features

z~ 掩码特征

Xinput space

X 输入空间

Youtput space

Y 输出空间

z latent space

z 潜在空间

Θ parameter space

Θ 参数空间

θ model parameters

θ 模型参数

fθ model predictor

fθ 模型预测器

ϕ feature extractor

ϕ 特征提取器

w classifier

w 分类器

Knumber of feature maps of the last convolutional layer

K 最后一卷积层的特征图数量

R(fθ) model risk

R(fθ) 模型风险

Rerm(fθ) empirical model risk

Rerm(fθ) 经验模型风险

Lloss term

L 损失项

三 set of source environments

三 源环境集合

Φ set of test environments

Φ 测试环境集合

ξ environment

ξ 环境

① meta distribution

① 元分布(meta distribution)

Uunlabeled dataset

U 无标签数据集

Llabeled dataset

L 有标签数据集

Hreproducing kernel Hilbert space

再生核希尔伯特空间

φ feature map induced by a kernel

φ 由核函数引导的特征映射

Δξ local envinronment bias

Δξ 局部环境偏差

θξ environment parameters

θξ 环境参数

Mc class activation map for class c

Mcc 的类激活图

gz gradient with respect to the features

gz 关于特征的梯度

g~z average pooled gradient values

g~z 平均池化的梯度值

Hz feature map height

Hz 特征图高度

Wz feature map width

Wz 特征图宽度

yc logit for class c

ycc 的logit值

mi,j feature mask at spatial location(i,j)

mi,j 空间位置(i,j)的特征掩码

Cchange vector after applying the mask

应用掩码后的变化向量

qp feature percentile threshold

qp 特征百分位阈值

bp batch percentile threshold

bp 批次百分位阈值

Chapter 3

第3章

gθ interpretable model

gθ 可解释模型

G set of interpretable models

G 可解释模型集合

Π complexity measure

Π 复杂度度量

P set of prototypes

P 原型集合

pjj -th prototype

pjj个原型

Hp height of the prototypes

Hp 原型高度

Wp width of the prototypes

Wp 原型宽度

gpj prototype unit

gpj 原型单元

gp prototype layer

gp 原型层

z

Ψj similarity map between j -th prototype and latent representation

Ψjj个原型与潜在表示之间的相似度图

ϵ numerical stability factor

ϵ 数值稳定因子

wc,j classifier weight connecting the j -th prototype unit and class c logit

wc,j 连接第j个原型单元与类别c对数几率的分类器权重

θϕ parameters of the feature extractor

θϕ 特征提取器参数

θw parameters of the classifier

θw 分类器参数

2 euclidean distance

2 欧氏距离

λ loss term weighting factor

λ 损失项权重因子


Chapter 4

第4章

τtap threshold for average pooling

τtap 平均池化阈值

J>m Set of Top- m negative classes

J>mm个负类集合

Uuniform probability matrix

均匀概率矩阵

Mc probability map

Mc 概率图

ω domain predictor

ω 域预测器

d domain ground truth

d 域真实标签

η 2 regularization weighting factor

η 2 正则化权重因子

k kernel function

k 核函数

Pset of source domain pairs -

P 源域对集合 -

ϱ cosine distance

ϱ 余弦距离

Sc support set for class c

Sc 类别c的支持集

Γ key head -

Γ 键头 -

Λ value head

Λ 值头

Ω query head

Ω 查询头

k keys of the support set

k 支持集的键

q queries

q 查询

Vsupport-set values -

V支持集值

Wquery image values

W查询图像值

α dot similarity between keys and queries

α 键与查询之间的点积相似度

α~ attention weights

α~ 注意力权重

Bbatch size

B批量大小

α learning rate

α 学习率

γ weight decay factor

γ 权重衰减因子

Introduction

引言

Modern machine learning solutions commonly rely on supervised deep neural networks as their default approach and one of the tasks that is commonly required and implemented in practice is image classification. In its simplest form, the used networks combine multiple layers of linear transformations coupled with non-linear activation functions to classify an input image based on its pixel values into a discrete set of classes. The chosen architecture, also known as the model, is then able to learn good parameters that represent how to combine the information extracted from the individual pixel values using additional labeled information. That is, for a set of images we know the correct class and can automatically guide the network towards well-working parameters by determining how wrong the current prediction is and in which direction we need to update the individual parameters, for our network to give a more accurate class prediction.

现代机器学习解决方案通常依赖监督式深度神经网络作为默认方法,其中一个常见且实际应用的任务是图像分类。在最简单的形式中,所使用的网络结合了多层线性变换和非线性激活函数,根据输入图像的像素值将其分类到离散类别集合中。所选架构,也称为模型,能够学习良好的参数,表示如何利用额外的标注信息结合从单个像素值中提取的信息。也就是说,对于一组图像,我们知道正确的类别,并能自动引导网络朝着更优参数更新,通过确定当前预测的误差及参数更新方向,使网络给出更准确的类别预测。

Obtaining this labeled information, however, is very tedious in practice and either requires a lot of manual human labeling or sufficient human quality assurance for any automatic labeling system. A commonly used approach to overcome this impediment is to combine multiple sources of data, that might have been collected in different settings but represent the same set of classes and have already been labeled a priori. Since the training distribution described by the obtained labeled data is then often different from the testing distribution imposed by the images we observe once we deploy our system, we commonly observe a distribution shift when testing and the network generally needs to do out-of-distribution predictions. Similar behavior can be observed when the model encounters irregular properties during testing such as weather or lighting conditions which have not been captured well by the training data. For many computer vision neural network models, this poses an interesting and challenging task which is known as out-of-distribution generalization where researchers try to improve the predictions under these alternating circumstances for more robust machine learning models.

然而,获取这些标注信息在实践中非常繁琐,通常需要大量人工标注或对任何自动标注系统进行充分的人为质量保证。为克服这一障碍,常用的方法是结合多个数据源,这些数据可能在不同环境下收集,但代表相同的类别集合且已事先标注。由于所获得的标注数据描述的训练分布通常与部署系统后观察到的测试分布不同,因此测试时常会出现分布偏移,网络通常需要进行分布外预测。当模型在测试时遇到训练数据未充分覆盖的异常属性,如天气或光照条件,也会出现类似情况。对于许多计算机视觉神经网络模型而言,这构成了一个有趣且具有挑战性的任务,称为分布外泛化(out-of-distribution generalization),研究者试图在这些变化环境下提升预测能力,以实现更鲁棒的机器学习模型。

In this work, we pick up on this challenge and try to improve the out-of-distribution generalization capabilities with models and techniques that have been proposed to make deep neural networks more explainable. For most machine learning settings, gaining a degree of explainability that tries to give humans more insights into how and why the network arrives at its predictions, restricts the underlying model and hinders performance to a certain degree. For example, decision trees are thought of as being more explainable than deep neural networks but lack performance on visual tasks. In this work, we investigate if these properties also hold for the out-of-distribution generalization task or if we can deploy explainability methods during the training procedure and gain, both, better performance as well as a framework that enables more explainability for the users. In particular, we develop a regularization technique based on class activation maps that visualize parts of an image that led to certain predictions (DIvCAM) as well as prototypical representations that serve as a number of class or attribute centroids which the network uses to make its predictions (ProDrop and D-TRANSFORMERS). Specifically, we deploy these methods for the domain generalization task where the model has access to images from multiple training domains, each imposing a different distribution, but without access to images from the immediate testing distribution.

本研究针对这一挑战,尝试利用旨在提升深度神经网络可解释性的模型和技术,改进分布外泛化能力。在大多数机器学习场景中,获得一定程度的可解释性以帮助人类理解网络如何及为何做出预测,通常会限制底层模型并在一定程度上影响性能。例如,决策树被认为比深度神经网络更具可解释性,但在视觉任务上的表现较差。本研究探讨这些特性是否同样适用于分布外泛化任务,或是否可以在训练过程中部署可解释性方法,同时获得更优性能和为用户提供更多可解释性的框架。具体而言,我们开发了一种基于类激活图(class activation maps,CAM)的正则化技术(DIvCAM),该技术可视化导致特定预测的图像部分;以及基于原型表示的技术,这些原型作为类别或属性的中心点,网络利用它们进行预测(ProDrop和D-TRANSFORMERS)。我们特别将这些方法应用于领域泛化任务,在该任务中模型可访问来自多个训练域的图像,每个域具有不同分布,但无法访问直接测试分布的图像。

From our experiments, we observe that especially DIvCAM offers state-of-the-art performance on some datasets while providing a framework that enables additional insights into the training and prediction procedure. For us, this is a property that is highly desirable, especially in safety-critical scenarios such as self-driving cars, any application in the medical field such as cancer or tumor prediction, or any other automation robot that needs to operate in a diverse set of environments. Hopefully, some of the methods presented in this work can find application in such scenarios and establish additional trust and confidence into the machine learning system to work reliable. All of our experiments have been conducted within the DOMAINBED domain generalization benchmarking framework and the respective code has been open-sourced. 1

实验结果表明,尤其是DIvCAM在某些数据集上提供了最先进的性能,同时提供了一个能够深入理解训练和预测过程的框架。对于我们而言,这一特性尤为重要,特别是在安全关键场景中,如自动驾驶汽车、医学领域的癌症或肿瘤预测,或任何需要在多样环境中运行的自动化机器人。希望本研究中提出的一些方法能在此类场景中得到应用,增强对机器学习系统的信任和可靠性。我们所有的实验均在DOMAINBED领域泛化基准框架内进行,相关代码已开源。1

Chapter 2 Domain Generalization

第二章 领域泛化

Machine learning systems often lack out-of-distribution generalization which causes models to heavily rely on the training distribution and as a result don't perform very well when presented with a different input distribution during testing. Examples are application scenarios where intelligent systems don't generalize well across health centers if the training data was only collected in a single hospital [4 , 31, 142] or when self-driving cars struggle under alternative lighting or weather conditions [36, 191]. Properties that are often interpreted falsely as part of the relevant feature set include backgrounds [17], textures [62], or racial biases [178]. Not only can a failure in capturing this domain shift lead to poor performance, but in safety-critical scenarios this can correspond to a large impact on people's lives. Due to the prevalence of this challenge for the wide-spread deployment of machine learning systems in diverse environments, many researchers tried to tackle this task with different approaches. In this chapter, we want to give a broad overview of the literature in domain generalization and prepare the fundamentals for the following chapters. If you are already familiar with the field, you can safely skip this chapter and only familiarize yourself with the used notation.

机器学习系统通常缺乏分布外泛化能力,这导致模型严重依赖训练分布,因此在测试时遇到不同输入分布时表现不佳。例子包括智能系统在健康中心之间泛化能力差的应用场景,如果训练数据仅在单一医院收集[4,31, 142],或者自动驾驶汽车在不同光照或天气条件下表现不佳[36, 191]。经常被错误解读为相关特征集一部分的属性包括背景[17]、纹理[62]或种族偏见[178]。未能捕捉这种领域转移不仅会导致性能下降,在安全关键场景中甚至可能对人们的生命产生重大影响。鉴于这一挑战在机器学习系统广泛部署于多样环境中的普遍性,许多研究者尝试用不同方法解决该任务。本章旨在对领域泛化的文献进行广泛综述,并为后续章节奠定基础。如果您已熟悉该领域,可以跳过本章,仅熟悉所用符号。

2.1 Problem formulation

2.1 问题表述

Supervised Learning In supervised learning we are aiming to optimize the predictions y^ for the values yY of a random variable Y when presented with values xX of a random variable X . These predictions are generated with a model predictor fθ():XY that is parameterized by θΘ , usually the weights of a neural network,and is assigning the predictions as y^=fθ() . To improve our predictions,we utilize a training dataset containing n input-output pairs denoted as D={(xi,yi)}i=1n where each sample (xi,yi) is ideally drawn identically and independently distributed (i.i.d.) from a single joint probability distribution D . By using a loss term L(y^,y):Y×YR+ ,which quantifies how different the prediction y^ is from the ground truth y ,we would like to minimize the risk,

监督学习 在监督学习中,我们旨在优化对随机变量Y的值yY的预测y^,条件是给定随机变量X的值xX。这些预测由参数为θΘ的模型预测器fθ():XY生成,通常是神经网络的权重,预测结果表示为y^=fθ()。为了提升预测效果,我们利用包含n个输入-输出对的训练数据集,记为D={(xi,yi)}i=1n,其中每个样本(xi,yi)理想情况下是从单一联合概率分布D独立同分布(i.i.d.)抽取的。通过使用损失项L(y^,y):Y×YR+,该项量化预测y^与真实值y的差异,我们希望最小化风险,

(2.1)R(fθ)=E(xi,yi)D[L(fθ(xi),yi)],

of our model. Since we only have access to the distribution D through a proxy in the form of the dataset D ,we are instead using Empirical Risk Minimization (ERM):

我们的模型风险。由于我们只能通过数据集D作为代理访问分布D,因此采用经验风险最小化(ERM):

(2.2)Rerm(fθ)=1ni=1nL(fθ(xi),yi),

by adding up the loss terms of each sample. One common choice for this loss term is the Cross-Entropy (CE) loss which is shown in Equation (2.3).

通过对每个样本的损失项求和。该损失项的常见选择是交叉熵(CE)损失,如公式(2.3)所示。

(2.3)Lce(y^i,y˙i)=c=1Cyi,clog(y^i,c)

Here, y˙i is the one-hot vector representing the ground truth class, y^i is the softmax output of the model, y^i,c and yi,c are the c -th dimension of y^i and y˙i respectively.

其中,y˙i是表示真实类别的独热向量,y^i是模型的softmax输出,y^i,cyi,c分别是y^iy˙i的第c维。

The occurring minimization problem is then often solved through iterative gradient-based optimization algorithms e.g. SGD [155] or ADAM [98] which perform on-par with recent methods for the non-convex, continuous loss surfaces produced by modern machine learning problems and architectures [163].

该最小化问题通常通过迭代的基于梯度的优化算法求解,例如SGD[155]或ADAM[98],这些算法在现代机器学习问题和架构产生的非凸连续损失面上表现与最新方法相当[163]。

On top of that,the model predictor fθ can be decomposed into two functions as fθ=wϕ where ϕ:XZ is an embedding into a feature space,hence sometimes called the feature extractor,and w:ZY which is called the classifier since it is a prediction from the feature space to the output space [73,131] . This often allows for a more concise mathematical notation.

此外,模型预测器fθ可以分解为两个函数,即fθ=wϕ,其中ϕ:XZ是映射到特征空间的嵌入,因此有时称为特征提取器,而w:ZY称为分类器,因为它是从特征空间到输出空间的预测[73,131]。这通常允许更简洁的数学表示。

Domain Generalization The problem of Domain generalization (DG) builds on top of this framework,where now we have a set of training environments Ξ={ξ1,,ξs} ,also known as source domains,where each environment ξ has an associated dataset Dξ={(xiξ,yiξ)}i=1nξ containing nξ i.i.d. samples from individual data distributions Dξ . Note that,while related,the environments have different joint distributions,i.e. DξiDξjij . Here, xiξRm is the i -th sample for environment ξ representing an m -dimensional feature vector (i.e. an image in our case) and yiξY is the corresponding ground truth class label over the C possible classes. The one-hot vector representing the ground truth is denoted as y˙iξ . To clear up some notation,we sometimes omit ξ where it is obvious. From these source domains, we try to learn generic feature representations agnostic to domain changes to improve model performance [166]. Simply, we try to do out-of-distribution generalization where our model aims to achieve good performance for an unseen test environment ξt sampled from the set of unseen environments Φ={ξ1,,ξt} with ΞΦ= based on statistical invariances across the observed training (source) and testing (target) domains [73,86] . For that,we try to minimize the expected target risk of our model as:

领域泛化(Domain Generalization,DG)问题基于此框架,现有一组训练环境Ξ={ξ1,,ξs},也称为源域,每个环境ξ对应一个数据集Dξ={(xiξ,yiξ)}i=1nξ,包含来自各自数据分布Dξ的独立同分布(i.i.d.)样本nξ。注意,虽然相关,这些环境具有不同的联合分布,即DξiDξjij。这里,xiξRm是环境ξ的第i个样本,表示一个m维特征向量(例如我们的图像),yiξY是对应的真实类别标签,类别总数为C。表示真实标签的独热向量记为y˙iξ。为简化符号,有时在明显情况下省略ξ。从这些源域中,我们尝试学习对域变化不敏感的通用特征表示,以提升模型性能[166]。简言之,我们尝试进行分布外泛化,模型旨在对从未见过的测试环境ξt(采样自未见环境集合Φ={ξ1,,ξt})表现良好,基于观察到的训练(源)和测试(目标)域[73,86]之间的统计不变性ΞΦ=。为此,我们尝试最小化模型的目标风险期望值:

(2.4)R(fθ)=E(xiξt,yiξt)Dξt[L(fθ(xiξt),yiξt)].

Since we don’t have access to Dξt during training,one simple approach is to assume that minimizing the risk over all source domains in Ξ achieves good generalization to the target domain. That is,we disregard the separated environments:

由于训练期间无法访问Dξt,一种简单方法是假设在所有源域Ξ上最小化风险能够实现对目标域的良好泛化。即,我们忽略了各个独立环境:

(2.5)R(fθ)=E(xi,yi)ξΞDξ[L(fθ(xi),yi)].

Again, this can be written with the empirical risk as a simple sum over all the environments and their corresponding samples:

同样,这可以用经验风险表示为对所有环境及其对应样本的简单求和:

(2.6)Rerm(fθ)=1sξΞ1nξi=1nξL(fθ(xiξ),yiξ).

The difference of this approach when compared to ordinary supervised learning is shown on a high-level in Table 2.1. It may also be helpful to think about a meta-distribution D (real-world) generating source environment distributions DξΞ and unseen testing domain distributions DξΦ as shown in Figure 2.1.

与普通监督学习相比,该方法的区别在表2.1中进行了高层次展示。思考一个元分布D(现实世界)生成源环境分布DξΞ和未见测试域分布DξΦ(如图2.1所示)也可能有所帮助。

Figure 2.1: Meta-distribution D generating source environment distributions (left) and unseen environment distributions (right), adapted from: [5]

图2.1:元分布D生成源环境分布(左)和未见环境分布(右),改编自:[5]

SetupTraining inputsTesting inputs
Generative learning\( {U}_{{\xi }_{1}} \)0
Unsupervised learning\( {U}_{{\xi }_{1}} \)\( {U}_{{\xi }_{1}} \)
Supervised learning\( {L}_{{\xi }_{1}} \)\( {U}_{{\xi }_{1}} \)
Semi-supervised learning\( {L}_{{\xi }_{1}},{U}_{{\xi }_{1}} \)\( {U}_{{\xi }_{1}} \)
Multitask learning\( {L}_{{\xi }_{1}},\ldots ,{L}_{{\xi }_{s}} \)\( {U}_{{\xi }_{1}},\ldots ,{U}_{{\xi }_{s}} \)
Continual (lifelong) learning\( {L}_{{\xi }_{1}},\ldots ,{L}_{{\xi }_{\infty }} \)\( {U}_{{\xi }_{1}},\ldots ,{U}_{{\xi }_{\infty }} \)
Domain Adaptation\( {L}_{{\xi }_{1}},\ldots ,{L}_{{\xi }_{s}},{U}_{{\xi }_{t}} \)\( {U}_{{\xi }_{t}} \)
Transfer learning\( {U}_{{\xi }_{1}},\ldots ,{U}_{{\xi }_{s}},{L}_{{\xi }_{t}} \)\( {U}_{{\xi }_{t}} \)
Domain generalization\( {L}_{{\xi }_{1}},\ldots ,{L}_{{\xi }_{s}} \)\( {U}_{{\xi }_{t}} \)
设置训练输入测试输入
生成式学习\( {U}_{{\xi }_{1}} \)0
无监督学习\( {U}_{{\xi }_{1}} \)\( {U}_{{\xi }_{1}} \)
监督学习\( {L}_{{\xi }_{1}} \)\( {U}_{{\xi }_{1}} \)
半监督学习\( {L}_{{\xi }_{1}},{U}_{{\xi }_{1}} \)\( {U}_{{\xi }_{1}} \)
多任务学习\( {L}_{{\xi }_{1}},\ldots ,{L}_{{\xi }_{s}} \)\( {U}_{{\xi }_{1}},\ldots ,{U}_{{\xi }_{s}} \)
持续(终身)学习\( {L}_{{\xi }_{1}},\ldots ,{L}_{{\xi }_{\infty }} \)\( {U}_{{\xi }_{1}},\ldots ,{U}_{{\xi }_{\infty }} \)
领域适应\( {L}_{{\xi }_{1}},\ldots ,{L}_{{\xi }_{s}},{U}_{{\xi }_{t}} \)\( {U}_{{\xi }_{t}} \)
迁移学习\( {U}_{{\xi }_{1}},\ldots ,{U}_{{\xi }_{s}},{L}_{{\xi }_{t}} \)\( {U}_{{\xi }_{t}} \)
领域泛化\( {L}_{{\xi }_{1}},\ldots ,{L}_{{\xi }_{s}} \)\( {U}_{{\xi }_{t}} \)

Table 2.1: Differences in learning setups adapted from: [73]. For each environment ξ the labeled and unlabeled datasets are denoted as Lξ and Uξ respectively.

表2.1:学习设置的差异,改编自:[73]。对于每个环境ξ,标记数据集和未标记数据集分别表示为LξUξ

Homogeneous and Heterogeneous Sometimes, domain generalization is also divided into homogeneous and heterogeneous subtasks. In homogeneous DG we assume that all domains share the same label space Yξi=Yξj=Yξt,ξiξjΞ . On the contrary,the more challenging heterogeneous DG allows for different label spaces YξiYξjYξt,ξiξjΞ which can even be completely disjoint [110]. For this work, we assume a homogeneous setting.

同质与异质 有时,领域泛化也被划分为同质和异质子任务。在同质领域泛化中,我们假设所有领域共享相同的标签空间Yξi=Yξj=Yξt,ξiξjΞ。相反,更具挑战性的异质领域泛化允许不同的标签空间YξiYξjYξt,ξiξjΞ,这些标签空间甚至可以完全不相交[110]。在本工作中,我们假设同质设置。

Single- and Multi-Source There exist subtle differences within this task which are called single- or multi-source domain generalization. While multi-source domain generalization refers to the standard-setting we have just outlined, single-source domain generalization is a more generic formulation [221]. Instead of relying on multiple training domains to learn models which generalize better, single-source domain generalization aims at learning these representations with access to only one source distribution. Hence,our training domains are restricted to Ξ={ξ1} ,described by one dataset Dξ1={(xiξ1,yiξ1)}i=1nξ1 and modeling a single source distribution Dξ1 . For example,this can be achieved by combining the different datasets or distributions similar to Equation (2.5) or mathematically ξΞDξ . This is different from the ordinary supervised learning setup since we want to analyze the performance of the model under a clear domain-shift (i.e. out-of-distribution generalization). Keep in mind, that strong regularization methods will also perform well on this subtask. These cross-over and related techniques are described in the following section.

单源与多源 该任务中存在细微差别,称为单源或多源领域泛化。多源领域泛化指的是我们刚才描述的标准设置,而单源领域泛化是一种更通用的表述[221]。单源领域泛化不依赖多个训练领域来学习更具泛化能力的模型,而是旨在仅通过访问单一源分布来学习这些表示。因此,我们的训练领域限制为Ξ={ξ1},由一个数据集Dξ1={(xiξ1,yiξ1)}i=1nξ1描述,并建模单一源分布Dξ1。例如,这可以通过合并不同的数据集或分布实现,类似于公式(2.5)或数学表达式ξΞDξ。这不同于普通的监督学习设置,因为我们希望分析模型在明显领域偏移(即分布外泛化)下的性能。请记住,强正则化方法在此子任务中也表现良好。这些交叉及相关技术将在下一节中描述。

2.2 Related concepts and their differences

2.2 相关概念及其差异

As already introduced, members of the causality community might know the task of domain generalization under the term learning from multiple environments [9,73,143] and researchers coming from deep learning might know it under learning from multiple domains. While these two concepts refer to the same task, there exist quite a few related techniques which we want to highlight here and distinguish in their scope. In particular, we focus on "Generic neural network regularization" and "Domain Adaptation" since each of these are very closely related to domain generalization and sometimes hard to distinguish, if at all. The overview in Table 2.1, however, includes even more learning setups to properly position this concept into the machine learning landscape.

如前所述,因果推断领域的研究者可能将领域泛化任务称为“多环境学习”[9,73,143],而深度学习领域的研究者可能称之为“多领域学习”。虽然这两个概念指向相同任务,但存在许多相关技术,我们希望在此强调并区分其范围。特别地,我们关注“通用神经网络正则化”和“领域适应”,因为它们与领域泛化密切相关,有时难以区分。表2.1中的概览则涵盖更多学习设置,以便将该概念准确定位于机器学习领域。

2.2.1 Generic Neural Network Regularization

2.2.1 通用神经网络正则化

In theory, generic model regularization which aims to prevent neural networks from overfitting on the source domain could also improve the domain generalization performance [86]. As such, methods like dropout [177], early stopping [30], or weight decay [136] can have a positive effect on this task when deployed properly. Apart from regular dropout, where we randomly disable neurons in the training phase to stop them from co-adapting too much, a few alternative methods exist. These include dropping random patches of input images (Cutout &HaS) [40, 172] or channels of the feature map (SpatialDropout) [185], dropping contiguous regions of the feature maps (DropBlock) [64], dropping features of high activations across feature maps and channels (MaxDrop) [139], or generalizing the traditional dropout of single units to entire layers during training (DropPath) [104]. There even exist methods like curriculum dropout [130] that deploy scheduling for the dropout probability and therefore softly increase the number of units to be suppressed layerwise during training.

理论上,旨在防止神经网络在源领域过拟合的通用模型正则化方法,也可能提升领域泛化性能[86]。因此,诸如dropout(随机失活)[177]、早停[30]或权重衰减[136]等方法,在适当应用时对该任务有积极影响。除了常规dropout,即在训练阶段随机禁用神经元以防止其过度协同外,还有一些替代方法。这些包括随机遮挡输入图像的部分区域(Cutout和HaS)[40,172],或特征图的通道(SpatialDropout)[185],遮挡特征图的连续区域(DropBlock)[64],遮挡跨特征图和通道的高激活特征(MaxDrop)[139],或将传统的单元dropout推广到训练期间整层的dropout(DropPath)[104]。甚至存在如课程式dropout[130],通过调度dropout概率,逐层软性增加训练中被抑制的单元数量。

Generally, deploying some of these methods when aiming for out-of-distribution generalization can be a good idea and should definitely be considered for the task of domain generalization.

通常,在追求分布外泛化时部署其中一些方法是一个不错的选择,且在领域泛化任务中应当予以考虑。

2.2.2 Domain Adaptation

2.2.2 领域适应

Domain Adaptation (DA) is often mentioned as a closely related task in domain generalization literature [131,148,192] . When compared,domain adaptation has additional access to an unlabeled dataset from the target domain [34,123] . Formally,aside from the set of source domains Ξ and the domain datasets Dξ ,as outlined in Section 2.1,we have access to target samples Uξt={x1ξt,,xnξtξt} that are from the target domain xiξtDξt but their labels remain unknown during training since we want to predict them during testing. As a result, domain generalization is considered to be the harder problem of the two. This difference is also shown in Table 2.1.

领域适应(Domain Adaptation,DA)常被认为是领域泛化文献中密切相关的任务[131,148,192]。相比之下,领域适应额外利用了来自目标域的无标签数据集[34,123]。形式上,除了第2.1节中概述的源域集合Ξ和域数据集Dξ外,我们还可以访问来自目标域xiξtDξt的目标样本Uξt={x1ξt,,xnξtξt},但其标签在训练期间未知,因为我们希望在测试时进行预测。因此,领域泛化被认为是两者中更具挑战性的问题。此差异也在表2.1中有所体现。

Earlier methods in this space deploy hand-crafted features to reduce the difference between the source and the target domains [126]. Like that, instance-based methods try to re-weight source samples according to target similarity [68,85,202] ,or feature-based methods try to learn a common subspace [14,54,69,119] . More recent works focus on deep domain adaptation based on deep architectures where domain invariant features are learned utilizing supervised neural networks [22, 29, 60, 67], autoencoders [209], or generative adversarial networks (GANs) [21, 169, 186]. These deep NN-based architectures significantly outperform the approaches for hand-crafted features [126].

早期方法采用手工设计的特征来减少源域与目标域之间的差异[126]。例如,基于实例的方法尝试根据与目标的相似度重新加权源样本[68,85,202],或基于特征的方法尝试学习一个公共子空间[14,54,69,119]。近年来的工作则聚焦于基于深度架构的深度领域适应,通过监督神经网络[22, 29, 60, 67]、自编码器[209]或生成对抗网络(GANs)[21, 169, 186]学习领域不变特征。这些基于深度神经网络的架构显著优于手工特征方法[126]。

Even though domain adaptation and domain generalization both try to reduce the dataset bias, they are not compatible with each other [65]. Hence, domain adaptation methods often cannot be directly used for domain generalization or vice versa [65]. For this work, we don't rely on the simplifying assumptions of domain adaptation, but instead, tackle the more challenging task of DG.

尽管领域适应和领域泛化都试图减少数据集偏差,但它们彼此并不兼容[65]。因此,领域适应方法通常不能直接用于领域泛化,反之亦然[65]。在本工作中,我们不依赖领域适应的简化假设,而是着手解决更具挑战性的领域泛化任务。

2.3 Previous Works

2.3 相关工作

Most commonly, work in domain generalization can be divided into methods that try to learn invariant features, combine domain-specific models in a process called model ensembling, pursue meta-learning, or utilize data augmentation to generate new domains or more robust representations. Since literature in the domain generalization space is broad, we utilized Gulrajani and Lopez-Paz [73, Appendix A] for an overview and to identify relevant literature while individually adding additional works and more detailed information where necessary.

领域泛化的工作通常可分为尝试学习不变特征、通过模型集成结合领域特定模型、追求元学习或利用数据增强生成新域或更鲁棒表示的方法。鉴于领域泛化领域文献广泛,我们参考了Gulrajani和Lopez-Paz[73,附录A]以获得概览并识别相关文献,同时在必要时单独补充额外工作和更详细信息。

2.3.1 Learning invariant features

2.3.1 学习不变特征

Methods that try to learn invariant features typically minimize the difference between source domains. They assume that with this approach the features will be domain-invariant and therefore will have good performance for unseen testing domains [86].

尝试学习不变特征的方法通常最小化源域之间的差异。他们假设通过这种方法,特征将具有领域不变性,因此在未见测试域上表现良好[86]。

Some of the earliest works on learning invariant features were kernel methods applied by Muandet, Balduzzi, and Schölkopf [132] where they experimented with a feature transformation that minimizes the across-domain dissimilarity between transformed feature distributions while preserving the functional relationship between original features and targets. In recent years, there have been approaches following a similar kernel-based approach [115, 116], sometimes while maximizing class separability [65,83] . As an early method,Fang,Xu,and Rockmore [52] introduce Unbiased Metric Learning (UML) with an SVM metric that enforces the neighborhood of samples to contain samples with the same class label but from other training domains.

关于学习不变特征的早期工作之一是Muandet、Balduzzi和Schölkopf[132]应用的核方法,他们尝试一种特征变换,最小化变换后特征分布间的跨域差异,同时保持原始特征与目标之间的函数关系。近年来,也有类似的基于核的方法[115, 116],有时同时最大化类别可分性[65,83]。作为早期方法,Fang、Xu和Rockmore[52]提出了无偏度量学习(Unbiased Metric Learning,UML),其基于支持向量机(SVM)的度量强制样本邻域包含来自其他训练域但同类别的样本。

After that, Ganin et al. [61] introduced Domain Adversarial Neural Networks (DANNs) using neural network architectures to learn domain-invariant feature representations by adding a gradient reversal layer. Recently, their approach got extended to support statistical dependence between domains and class labels [2] or considering one-versus-all adversaries to minimize pairwise divergences between source distributions [5]. Motiian et al. [131] use a siamese architecture to learn a feature transformation that tries to achieve semantical alignment of visual domains while maximally separating them. Other methods are also matching the feature covariance across source domains [150] or take a causal interpretation to match representations of features [122]. Huang et al. [86] have also shown that self-challenging (i.e. dropping features with high gradient values at each epoch) works very well.

随后,Ganin 等人[61]引入了领域对抗神经网络(Domain Adversarial Neural Networks, DANNs),通过在神经网络架构中添加梯度反转层来学习领域不变的特征表示。最近,他们的方法被扩展以支持领域与类别标签之间的统计依赖关系[2],或考虑一对多对抗以最小化源分布之间的成对散度[5]。Motiian 等人[131]采用孪生网络架构学习特征变换,旨在实现视觉领域的语义对齐,同时最大限度地分离它们。其他方法还包括匹配源领域间的特征协方差[150],或采用因果解释来匹配特征表示[122]。Huang 等人[86]也证明了自我挑战(即每个训练周期丢弃梯度值高的特征)效果显著。

Matsuura and Harada [127] use clustering techniques to split single-source domain generalization into different domains and then train a domain-invariant feature extractor via adversarial learning. Other works have also deployed similar approaches based on adversarial strategies [37, 92].

Matsuura 和 Harada[127]利用聚类技术将单一源领域泛化划分为不同领域,然后通过对抗学习训练领域不变的特征提取器。其他研究也采用了基于对抗策略的类似方法[37, 92]。

Li et al. [112] deploy adversarial autoencoders with maximum mean discrepancy (MMD) [72] to align the source distributions,i.e. for distributions Dξ1,Dξ2 and a feature map φ:XH where H is a reproducing kernel Hilbert space (RKHS) this measure is defined as Equation (2.7).

Li 等人[112]采用带有最大均值差异(Maximum Mean Discrepancy, MMD)[72]的对抗自编码器对源分布进行对齐,即对于分布Dξ1,Dξ2和特征映射φ:XH,其中H是再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS),该度量定义如公式(2.7)所示。

(2.7)MMD(Dξ1,Dξ2)=E(xi,yi)Dξ1[φ(xiξ1)]]E(xi,yi)Dξ2[φ(xiξ2)]H

Ilse et al. [88] extend the variational autoencoder [99] by introducing latent representations for environments Zξ ,classes Zy ,and residual variations Zx . Further,Li et al. [110] use episodic training i.e. they train a domain agnostic feature extractor ϕ and classifier w by mismatching them with an equivalent trained on a specific domain ϕξ and wξ in combinations (ϕξ1,wξ2,xiξ2) and (ϕξ2,wξ1,xiξ2) and letting them predict data outside of the trained domain ξ1ξ2 . Piratla,Netrapalli,and Sarawagi [145] also learn domain-specific and common components but the domain-specific parts are discarded after training. Li et al. [109] deploy a lifelong sequential learning strategy.

Ilse 等人[88]通过引入环境Zξ、类别Zy和残差变异Zx的潜在表示,扩展了变分自编码器(Variational Autoencoder, VAE)[99]。此外,Li 等人[110]采用情景训练,即通过将领域无关的特征提取器ϕ和分类器w与在特定领域ϕξwξ上训练的对应模型以组合(ϕξ1,wξ2,xiξ2)(ϕξ2,wξ1,xiξ2)的方式错配,并让它们预测训练领域外的数据ξ1ξ2。Piratla、Netrapalli 和 Sarawagi[145]也学习领域特定和通用组件,但训练后舍弃领域特定部分。Li 等人[109]采用了终身顺序学习策略。

2.3.2 Model ensembling

2.3.2 模型集成

Some methods try to associate model parameters with each of the training domains and combine them, often together with shared parameters, in a meaningful matter to improve generalization to the test domain. Commonly, the number of models in these type of architectures grow linearly with the number of source domains.

一些方法尝试将模型参数与每个训练领域关联,并将它们(通常与共享参数一起)以有意义的方式组合,以提升对测试领域的泛化能力。通常,这类架构中的模型数量会随着源领域数量线性增长。

The first work to pose the problem of domain generalization and analyze it was Blanchard, Lee, and Scott [20]. In there,they use classifiers for each sample xξ denoted as fθ(xξ,μξ) where μξ corresponds to a kernel mean embedding [133]. For theoretical analysis on such methods please see Deshmukh et al. [38] and Blanchard et al. [19]. Later on,Khosla et al. [96] combine global weights θ with local domain biases Δξ to learn one max-margin linear classifier (SVM) per domain as θξ=θ+Δξ and finally combine them, which has recently been extended to neural network settings by adding an additional dimension describing the training domains to the parameter tensors [107]. Ghifary et al. [66] propose a Multi-task Autoencoder (MTAE) with shared parameters to the hidden state and domain-specific parameters for each of the training domains. Further, Mancini et al. [125] use domain-specific batch-normalization [89] layers and then linearly combine them using a softmax domain classifier. Other works utilize other domain-specific normalization techniques [166], linearly combine domain-specific predictors [124], or use more elaborate aggregation strategies [35]. Ding and Fu [42] use multiple domain-specific deep neural networks with a structured low-rank constraint and a domain-invariant deep neural network to generalize to the target domain. There have also been works that assign weights to mini-batches depending on their respective error to the training distributions [84, 159]. Jin et al. [94] use attention mechanisms to align the features of the different training domains.

最早提出领域泛化问题并进行分析的工作是Blanchard、Lee和Scott [20]。他们为每个样本使用分类器xξ,记作fθ(xξ,μξ),其中μξ对应于核均值嵌入(kernel mean embedding)[133]。关于此类方法的理论分析,请参见Deshmukh等人[38]和Blanchard等人[19]。随后,Khosla等人[96]将全局权重θ与局部领域偏差Δξ结合,学习每个领域的最大间隔线性分类器(支持向量机,SVM)θξ=θ+Δξ,并最终将其组合,最近该方法通过向参数张量添加描述训练领域的额外维度扩展到了神经网络设置[107]。Ghifary等人[66]提出了具有共享隐藏状态参数和每个训练领域特定参数的多任务自编码器(Multi-task Autoencoder,MTAE)。此外,Mancini等人[125]使用领域特定的批量归一化(batch-normalization)[89]层,然后通过软最大(softmax)领域分类器线性组合它们。其他工作利用了其他领域特定的归一化技术[166],线性组合领域特定预测器[124],或采用更复杂的聚合策略[35]。Ding和Fu[42]使用多个具有结构化低秩约束的领域特定深度神经网络和一个领域不变的深度神经网络,以实现对目标领域的泛化。还有一些工作根据小批量样本相对于训练分布的误差分配权重[84, 159]。Jin等人[94]使用注意力机制对不同训练领域的特征进行对齐。

2.3.3 Meta-learning

2.3.3 元学习

Meta-learning approaches provide algorithms that tackle the problem of learning to learn [161, 183]. As such, Finn, Abbeel, and Levine [56] propose a Model-Agnostic Meta-Learning (MAML) algorithm that can quickly learn new tasks with fine-tuning. Li et al. [108] adapt this algorithm for domain generalization (no fine-tuning) such that we can adapt to new domains by utilizing the meta-optimization objective which ensures that steps to improve training domain performance should also improve testing domain performance. Both approaches are not bound to a specific architecture and can therefore be deployed for a wide variety of learning tasks. These approaches recently got extended by two reg-ularizers that encourage general knowledge about inter-class relationships and domain-independent class-specific cohesion [46], to heterogeneous domain generalization [117], or via meta-learning a regularizer that encourages across-domain performance [15].

元学习方法提供了解决“学习如何学习”问题的算法[161, 183]。例如,Finn、Abbeel和Levine[56]提出了一种模型无关元学习(Model-Agnostic Meta-Learning,MAML)算法,能够通过微调快速学习新任务。Li等人[108]将该算法适配于领域泛化(无微调),通过利用元优化目标,使得提升训练领域性能的步骤也能提升测试领域性能,从而实现对新领域的适应。这两种方法不依赖于特定架构,因此可应用于多种学习任务。近期,这些方法通过两个正则项得到扩展,分别鼓励关于类间关系的通用知识和领域无关的类内凝聚力[46],应用于异构领域泛化[117],或通过元学习引入鼓励跨领域性能的正则项[15]。

2.3.4 Data Augmentation

2.3.4 数据增强

Data Augmentation remains a competitive method for generalizing to unseen domains [211]. Works in this segment try to extend the source environments to a wider range of domains by augmenting the available training environments. However, to deploy an efficient procedure for that, human experts need to consider the data at hand to develop a useful routine [73].

数据增强依然是泛化到未见领域的有效方法[211]。该领域的工作尝试通过扩展可用训练环境来覆盖更广泛的领域。然而,为了部署高效的增强流程,需由人工专家根据手头数据设计合适的方案[73]。

Several works have used the MIXUP [210] algorithm as a method to merge samples from different domains [123,195,200,203] . Other works have also tried removing textural information from images [194] or shifting it more towards shapes [10, 135]. Carlucci et al. [28] used jigsaw puzzles of image patches as a classification task to show that this improves domain generalization while Volpi et al. [193] demonstrate that adversarial data augmentation on a single domain is sufficient. Further, Volpi and Murino [192] use popular image transformations (e.g. brightness, contrast, sharpness) with different intensity levels to train a more robust model, or Somavarapu, Ma, and Kira [174] use other stylizing techniques. Several methods also use GANs to augment the available training data [151,167,219] or use other methods to generate synthetic domains [218]. Qiao, Zhao, and Peng [148] deploy an adversarial domain augmentation approach using a Wasserstein Auto-Encoder [184].

多项工作采用MIXUP[210]算法作为融合不同领域样本的方法[123,195,200,203]。其他工作尝试去除图像的纹理信息[194]或更偏向形状特征[10, 135]。Carlucci等人[28]将图像拼图作为分类任务,证明其能提升领域泛化能力;Volpi等人[193]则展示了单一领域的对抗性数据增强已足够。此外,Volpi和Murino[192]利用不同强度的常见图像变换(如亮度、对比度、锐度)训练更鲁棒的模型,Somavarapu、Ma和Kira[174]则采用其他风格化技术。多种方法还使用生成对抗网络(GAN)增强训练数据[151,167,219],或采用其他方法生成合成领域[218]。Qiao、Zhao和Peng[148]采用基于Wasserstein自编码器(Wasserstein Auto-Encoder)[184]的对抗性领域增强方法。

2.4 Common Datasets

2.4 常用数据集

There exist several datasets that are commonly used in domain generalization research. Here, we want to introduce the most popular choices as well as interesting datasets to consider. We give an overview over Rotated MNIST, Colored MNIST, Office-Home, VLCS, PACS, Terra Incognita, DomainNet, and ImageNet-C. Currently, the most popular choices include PACS, VLCS, and Office-Home.

在领域泛化研究中存在几个常用的数据集。这里,我们将介绍最受欢迎的选择以及值得考虑的有趣数据集。我们概述了Rotated MNIST、Colored MNIST、Office-Home、VLCS、PACS、Terra Incognita、DomainNet和ImageNet-C。目前,最受欢迎的选择包括PACS、VLCS和Office-Home。

2.4.1 Rotated MNIST

2.4.1 旋转MNIST

The Rotated MNIST (RMNIST) dataset [66] is a variation of the original MNIST dataset [105] where each digit got rotated by degrees {0,15,30,45,60,75} . Each rotation angle represents one domain as shown in Table 2.2 for classes " 2" and " 4". The overall dataset in Gulrajani and Lopez-Paz [73] includes 70000 images from 10 homogeneous classes(0 - 9)each with dimension 1×28×28 .

旋转MNIST(Rotated MNIST,RMNIST)数据集[66]是原始MNIST数据集[105]的一个变体,其中每个数字都被旋转了{0,15,30,45,60,75}度。每个旋转角度代表一个域,如表2.2中“2”和“4”类所示。Gulrajani和Lopez-Paz[73]中的整体数据集包含来自10个同质类别(0-9)的70000张图像,每张图像尺寸为1×28×28

2.4.2 Colored MNIST

2.4.2 彩色MNIST

The Colored MNIST (CMNIST) dataset [9] is another variation of the original MNIST dataset [105]. The grayscale images of MNIST got colored in red and green. The respective label corresponds to a combination of digit and color where the color correlates to the class label with factors {0.1,0.2,0.9} as domains and the digit has a constant correlation of 0.75 . Since this is a synthetic dataset, the factors can easily be adapted or extended to more domains. We report these numbers since they are used in Arjovsky et al. [9] and Gulrajani and Lopez-Paz [73]. Since the correlation factor between color and label varies between domains, this dataset is well-suited for determining a models capability of removing color as a predictive feature [9].

彩色MNIST(Colored MNIST,CMNIST)数据集[9]是原始MNIST数据集[105]的另一种变体。MNIST的灰度图像被染成红色和绿色。相应的标签对应于数字和颜色的组合,其中颜色与类别标签相关,作为域的因素为{0.1,0.2,0.9},数字与标签的相关性恒定为0.75。由于这是一个合成数据集,这些因素可以轻松调整或扩展到更多域。我们报告这些数值是因为它们被Arjovsky等[9]和Gulrajani与Lopez-Paz[73]使用。由于颜色与标签之间的相关因子在不同域间变化,该数据集非常适合评估模型去除颜色作为预测特征的能力[9]。

To construct the dataset Arjovsky et al. [9] first assign an initial label y~=0 for digit 04 and y~=1 for digit 59 . This initial label is then flipped with a probability of 25% to obtain the final label y . Finally,we obtain the color z by flipping the label y with probabilities pd{0.1,0.2,0.9} depending on the domain. The image is then colored red for z=1 or green for z=0 [9]. Samples for both classes across domains can be seen in Table 2.2.

为了构建该数据集,Arjovsky等[9]首先为数字04分配初始标签y~=0,为数字59分配初始标签y~=1。然后以概率25%翻转该初始标签以获得最终标签y。最后,根据域的不同,以概率pd{0.1,0.2,0.9}翻转标签y得到颜色z。图像随后被染成红色对应z=1,绿色对应z=0[9]。表2.2中展示了跨域两个类别的样本。

Overall, the dataset in Gulrajani and Lopez-Paz [73] contains 70000 images from 2 homogeneous classes (green & red) of dimension 2×28×28 .

总体而言,Gulrajani和Lopez-Paz[73]中的数据集包含来自2个同质类别(绿色&和红色)的70000张图像,尺寸为2×28×28

2.4.3 Office-Home

2.4.3 Office-Home

The Office-Home dataset [190] provides 15588 images from 65 categories across 4 domains. The domains include Art, Clipart, Products (objects without a background), and Real-World (captured with a regular camera). Samples from these domains for the classes "alarm-clock" and "bed" can be seen in Table 2.2. On average, each class contains around 70 images with a maximum of 99 images in a category [190]. In Gulrajani and Lopez-Paz [73] they use dimension 3×224×224 for each image.

Office-Home数据集[190]提供了来自4个域的65个类别共15588张图像。域包括艺术(Art)、剪贴画(Clipart)、产品(无背景物体)和现实世界(使用普通相机拍摄)。表2.2中展示了“闹钟”和“床”两个类别在这些域中的样本。平均而言,每个类别约有70张图像,单个类别最多99张[190]。Gulrajani和Lopez-Paz[73]中每张图像的尺寸为3×224×224

Table 2.2: Samples for two different classes across domains for popular datasets

表2.2:流行数据集中两个不同类别跨域的样本

2.4.4 VLCS

2.4.4 VLCS

The VLCS dataset [52] is a dataset that utilizes photographic datasets as individual domains. As such, it contains the domains PASCAL VOC (V) [51], LabelMe (L) [158], Caltech101 (C) [111], and SUN09 (S) [33]. In total, there are 10729 images from 5 classes. Samples for the classes "bird" and "car" can be seen in Table 2.2. In Gulrajani and Lopez-Paz [73] they use the dimension 3×224×224 for each image.

VLCS数据集[52]利用了摄影数据集作为独立域。它包含PASCAL VOC(V)[51]、LabelMe(L)[158]、Caltech101(C)[111]和SUN09(S)[33]四个域。总计有来自5个类别的10729张图像。表2.2中展示了“鸟”和“车”两个类别的样本。Gulrajani和Lopez-Paz[73]中每张图像的尺寸为3×224×224

2.4.5 PACS

2.4.5 PACS

The PACS dataset [107] consists of images from different domains including Photo (P), Art (A), Cartoon (C), and Sketch (S) as individual domains. As such, it extends the previously photo-dominated data sets in domain generalization [107]. It includes seven homogeneous classes (dog, elephant, giraffe, guitar, horse, house, person) across the four previously mentioned domains. Table 2.2 shows samples from the "dog" and "elephant" classes across all domains.

PACS数据集[107]包含来自不同领域的图像,包括照片(Photo,P)、艺术(Art,A)、卡通(Cartoon,C)和素描(Sketch,S)作为独立领域。因此,它扩展了先前以照片为主的数据集在领域泛化中的应用[107]。该数据集涵盖了七个同质类别(狗、大象、长颈鹿、吉他、马、房屋、人物),分布于上述四个领域。表2.2展示了“狗”和“大象”类别在所有领域中的样本。

In total, PACS contains 9991 images which got obtained by intersecting classes from Caltech256 (Photo), Sketchy (Sketch) [160], TU-Berlin (Sketch) [49], and Google Images (all but Sketch) [107].

PACS总共包含9991张图像,这些图像通过交集Caltech256(照片)、Sketchy(素描)[160]、TU-Berlin(素描)[49]和Google Images(除素描外所有)[107]中的类别获得。

2.4.6 Terra Incognita

2.4.6 Terra Incognita

The Terra Incognita dataset is a subset of the initial Caltech camera traps dataset proposed by Beery, Horn, and Perona [17]. It contains photographs of wild animals taken by camera traps at different locations (IDs:38,43,46,100) which represent the domains. As such,the version used by Gulrajani and Lopez-Paz [73] contains 24788 images of 10 classes,each with size 3×224×224 . Samples from two different classes can be seen in Table 2.2. The chosen locations represent the Top-4 locations with the largest number of images, each with more than 4000 images.

Terra Incognita数据集是Beery、Horn和Perona提出的初始Caltech相机陷阱数据集的一个子集[17]。它包含由相机陷阱在不同地点(ID:38、43、46、100)拍摄的野生动物照片,这些地点代表不同领域。因此,Gulrajani和Lopez-Paz使用的版本[73]包含24788张10个类别的图像,每张图像尺寸为3×224×224。表2.2展示了两个不同类别的样本。所选地点代表图像数量最多的前四个地点,每个地点的图像数量均超过4000张。

The main data challenges that arise in this dataset include illumination, motion blur, size of the region of interest, occlusion, camouflage, and perspective [17]. This includes animals not always being salient or them being small or far from the camera which results in only partial views of the animals' body being available [17].

该数据集面临的主要挑战包括光照、运动模糊、感兴趣区域大小、遮挡、伪装和视角[17]。这包括动物并非总是显著,或者它们体积较小或距离相机较远,导致只能获得动物身体的部分视图[17]。

2.4.7 DomainNet

2.4.7 DomainNet

The DomainNet dataset [141] contains six domains: clipart (48129 images), infographic (51605 images), painting (72266 images), quickdraw (172500 images), real (172947 images), and sketch (69128 images) for 345 classes. In total, it contains 586575 images that got accumulated by searching a category with a domain name in multiple image search engines and, as an exception, players of the game "Quick Draw!" for the quickdraw domain [141].

DomainNet数据集[141]包含六个领域:剪贴画(clipart,48129张图像)、信息图(infographic,51605张图像)、绘画(painting,72266张图像)、快速绘图(quickdraw,172500张图像)、真实图像(real,172947张图像)和素描(sketch,69128张图像),涵盖345个类别。总计包含586575张图像,这些图像通过在多个图像搜索引擎中以类别名加领域名搜索获得,快速绘图领域的图像则来自游戏“Quick Draw!”的玩家[141]。

To secure the quality of the dataset, they hired 20 annotators for a total of 2500 hours to filter out falsely labeled images [141]. Each category has an average of 150 images for the domains clipart and infographic, 220 images for painting and sketch, and 510 for the real domain [141].

为保证数据集质量,他们雇佣了20名标注员,累计工作2500小时,以筛除错误标注的图像[141]。每个类别在剪贴画和信息图领域平均有150张图像,绘画和素描领域平均220张,真实领域平均510张[141]。

2.4.8 ImageNet-C

2.4.8 ImageNet-C

The ImageNet-C dataset [80] contains images out of ImageNet [157] permutated according to 15 corruption types each with 5 levels of severity which results in 75 domains. The types of corruptions are out of the categories "noise", "blur", "weather", and "digital" [80]. Table 2.2 shows a few of the available corruptions at severity 3 for the same image sample of two different classes. As their corruptions, they provide Gaussian Noise, Shot Noise, Impulse Noise, Defocus Blur, Frosted Glass Blur, Motion Blur, Zoom Blur, Snow, Frost, Fog, Brightness, Contrast, Elastic, and Pixelate [80]. Overall, the dataset provides all 1000 ImageNet classes where each image has the standard dimension of 3×224×224 . Currently,this dataset is not implemented by Gulrajani and Lopez-Paz [73].

ImageNet-C数据集[80]包含来自ImageNet[157]的图像,这些图像根据15种腐蚀类型进行变换,每种腐蚀有5个严重程度等级,共形成75个领域。腐蚀类型分为“噪声”、“模糊”、“天气”和“数字”类别[80]。表2.2展示了两个不同类别同一图像样本在严重程度3下的部分腐蚀示例。腐蚀类型包括高斯噪声、散弹噪声、脉冲噪声、散焦模糊、磨砂玻璃模糊、运动模糊、缩放模糊、降雪、霜冻、雾、亮度、对比度、弹性变换和像素化[80]。总体而言,该数据集涵盖了ImageNet的全部1000个类别,每张图像尺寸标准为3×224×224。目前,该数据集尚未被Gulrajani和Lopez-Paz实现[73]。

2.5 Considerations regarding model validation

2.5 关于模型验证的考虑

In traditional supervised learning setups, we train our model on the training dataset, validate its hyperparameters (e.g. number of layers or hidden units in a neural network) on a separate validation dataset, and finally evaluate our model on an unused test dataset. Notice, that the validation dataset should be distributed identically to the test data to properly fit the hyperparameters of our architecture. This is not a straightforward process for domain generalization, as we lack a proper validation dataset with the needed statistical properties. There exist several approaches to this problem, some of them being more grounded than others. Here we give an overview of the three approaches outlined by Gulrajani and Lopez-Paz [73] that are respectively used in their DG framework DomAINBED.

在传统的监督学习设置中,我们在训练数据集上训练模型,在单独的验证数据集上调整超参数(例如神经网络的层数或隐藏单元数),最后在未使用的测试数据集上评估模型。需要注意的是,验证数据集应与测试数据分布一致,以便正确调整架构的超参数。对于领域泛化而言,这一过程并不简单,因为缺乏具有所需统计特性的合适验证数据集。对此问题存在多种解决方案,其中一些更为合理。这里我们概述Gulrajani和Lopez-Paz[73]提出的三种方法,这些方法分别被用于他们的领域泛化框架DomAINBED中。

Training-domain validation set In this approach, each training domain gets further split into training and validation subsets where all validation subsets across domains get pooled into one global validation set. We can then maximize the model's performance on that global validation set to set the hyperparameters. This approach assumes the similarity of training and test distributions.

训练领域验证集 在此方法中,每个训练领域进一步划分为训练子集和验证子集,所有领域的验证子集汇总成一个全局验证集。然后,我们可以通过最大化模型在该全局验证集上的性能来设置超参数。该方法假设训练和测试分布相似。

Leave-one-domain-out cross-validation We can train s models with equal hyperparameters based on the s training domains where we each hold one of the domains out of training. This allows us to validate on the held-out domain and average among them to calculate the global held-out domain accuracy. Based on that, we can choose a model and re-train it on all of the training domains.

留一域交叉验证 我们可以基于s训练域训练具有相同超参数的s模型,每次将其中一个域排除在训练之外。这样可以在被排除的域上进行验证,并对其结果取平均以计算全局被排除域的准确率。基于此,我们可以选择一个模型并在所有训练域上重新训练。

Test-domain validation set (oracle) A rather statistically biased way of validating the model's hyperparameters is incorporating the test dataset as a validation dataset. Because of this, it is considered bad style and should be avoided or at least explicitly marked. However, one method that is possible is to restrict the test dataset access as done by Gulrajani and Lopez-Paz [73] where they prohibit early stopping and only use the last checkpoint.

测试域验证集(oracle) 一种相当统计偏倚的模型超参数验证方法是将测试数据集作为验证数据集纳入。正因如此,这种做法被认为是不规范的,应避免或至少明确标注。然而,一种可行的方法是限制测试数据集的访问,如Gulrajani和Lopez-Paz [73]所做,他们禁止提前停止训练,仅使用最后的检查点。

Other works have also come up with alternative methods to choose the hyperparameters. For example, Krueger et al. [101] validate the hyperparameters on all domains of the VLCS dataset and then apply the settings to PACS while D'Innocente and Caputo [35] use a validation technique that combines probabilities specific to their method.

其他研究也提出了选择超参数的替代方法。例如,Krueger等人[101]在VLCS数据集的所有域上验证超参数,然后将设置应用于PACS,而D'Innocente和Caputo[35]则使用结合其方法特定概率的验证技术。

2.6 Deep-Dive into Representation Self-Challenging

2.6 表征自我挑战(Representation Self-Challenging)深入解析

Since some of our proposed methods use ideas from Representation Self-Challenging (RSC) [86], we explain their approach more in detail here. They deploy two RSC variants called Spatial-Wise RSC and Channel-Wise RSC which they randomly alternate between. Generally, these are shown in Algorithm 1 and operate on features after the last convolutional layer.

鉴于我们提出的一些方法借鉴了Representation Self-Challenging(RSC)[86]的思想,现详细说明其方法。他们采用了两种RSC变体,称为空间维度RSC(Spatial-Wise RSC)和通道维度RSC(Channel-Wise RSC),并随机交替使用。总体上,这些方法如算法1所示,作用于最后卷积层之后的特征。

First, RSC calculates the gradient of the upper layer with respect to the latent feature representation according to Equation (2.8). Here, is the element-wise product and y˙ is the one-hot encoding of

首先,RSC根据公式(2.8)计算上层相对于潜在特征表示的梯度。这里,表示元素乘积,y˙

the ground truth.

真实标签的独热编码。

(2.8)gz=(w(z)y˙)z

Afterward,they average-pool the gradients to obtain g~z . The key difference between the Spatial-Wise and Channel-Wise RSC lies in the average pooling done to compute g~z in line 5 and the duplication in line 6 in Algorithm 1. While for Spatial-Wise RSC average pooling is done on the channel dimension according to Equation (2.9) yielding g~z,i,jRHz×Wz×1 for spatial location(i,j),in Channel-Wise RSC the same computation is done on the spatial dimension with Equation (2.10) yielding g~zR1×1×K , a vector with the size of the feature map count.

随后,他们对梯度进行平均池化以获得g~z。空间维度RSC与通道维度RSC的关键区别在于算法1第5行计算g~z时的平均池化方式及第6行的复制操作。空间维度RSC在通道维度上进行平均池化,依据公式(2.9)得到空间位置(i,j)的g~z,i,jRHz×Wz×1;而通道维度RSC则在空间维度上进行相同计算,依据公式(2.10)得到特征图数量大小的向量g~zR1×1×K

(2.9)g~z,i,j=1Kk=1Kgz,i,jk

(2.10)g~z=1HzWzi=1Hzj=1Wzgz,i,j

Depending on which dimensions are missing to get back to the original size of z,gzRHz×Wz×K ,the computed values get duplicated along these dimensions. In the case of the Spatial-Wise RSC these are the channels, while for the Channel-Wise RSC these are the spatial dimensions.

根据缺失的维度以恢复到原始z,gzRHz×Wz×K大小,计算值沿这些维度被复制。空间维度RSC中缺失的是通道维度,而通道维度RSC中缺失的是空间维度。

Next,Huang et al. [86] compute the(100 - p)th percentile with the threshold value as qp and compute the mask mi,j for spatial location(i,j)based on Equation (2.11). This mask is set to 0 for the corresponding Top- p percentage elements in g~z and therefore has the same shape.

接着,Huang等人[86]计算第(100 - p)百分位数,阈值为qp,并基于公式(2.11)为空间位置(i,j)计算掩码mi,j。该掩码在g~z中对应的Top-p百分比元素处设为0,因此掩码形状与g~z相同。

(2.11)mi,j={0, if g~z,i,jqp1, otherwise 

Huang et al. [86] apply the computed mask on the feature representation to yield z~p=zm which they validate using Equation (2.12). This computes the difference with and without the masking in the correct class probabilities and yields a difference score for each sample in the vector c .

Huang等人[86]将计算得到的掩码应用于特征表示,得到z~p=zm,并使用公式(2.12)进行验证。该公式计算掩码作用前后正确类别概率的差异,并为每个样本在向量c中产生一个差异分数。

(2.12)c=c=1C(softmax(w(z))y˙softmax(w(z~))y˙)c

A positive value represents that the masking for that sample made the classifier less certain about the correct class while a negative value represents the opposite and made the classifier more certain about the correct class. Similar to previously,Huang et al. [86] calculate Top- p of the positive values with the threshold as bp . They revert the whole masking for all Top- p samples inside each batch according to Equation (2.13) where each spatial location(i,j)of the mask associated with sample n gets set back to 1 if the condition applies, otherwise the mask values remain unchanged.

正值表示该样本的掩码使分类器对正确类别的置信度降低,负值则表示相反,使分类器对正确类别的置信度提高。与之前类似,Huang 等人[86]计算正值的前p,阈值为bp。他们根据公式(2.13)对每个批次中所有前p样本的掩码整体还原,其中样本n对应掩码的每个空间位置(i,j)若满足条件则设为1,否则掩码值保持不变。

(2.13)mi,jn={1, if cnbp, otherwise 

Finally,we mask the features with the obtained final mask to obtain z~=zm ,compute the loss L(w(z~),y) and backpropagate to the whole network.

最后,我们用得到的最终掩码对特征进行掩码处理以获得z~=zm,计算损失L(w(z~),y)并对整个网络进行反向传播。

Problems Interestingly, since many architectures like ResNet-18/ResNet-50 deploy average pooling in their forward pass after the last convolutional layer, naïve Spatial-Wise RSC doesn't make sense since average pooling is done along the channel dimension and such architectures additionally average pool on the spatial dimension. This results in feature values getting spread evenly across the image regardless of the masking. Even though this isn't mentioned in their paper, they address this issue in their official repository and propose an alternative computation. For that, they calculate the mean

问题有趣的是,由于许多架构如ResNet-18/ResNet-50在最后一个卷积层后采用了平均池化,简单的空间维度RSC没有意义,因为平均池化是在通道维度上进行的,这些架构还在空间维度上进行平均池化,导致特征值在图像上均匀分布,与掩码无关。虽然论文中未提及此问题,但他们在官方代码库中对此进行了处理并提出了替代计算方法。为此,他们计算了均值

Algorithm 1: Spatial- and Channel-Wise RSC

算法1:空间和通道维度的RSC


Input: Data X,Y with xiRH×W×3 ,drop factor p ,epochs T

输入:数据X,Y,样本数xiRH×W×3,丢弃因子p,训练轮数T

while epoch T do

当训练轮数为T

for every sample (or batch) \( \mathbf{x},\mathbf{y} \) do
对每个样本(或批次)\( \mathbf{x},\mathbf{y} \)执行
	Extract features \( \mathbf{z} = \phi \left( \mathbf{x}\right) \) // \( \mathbf{z} \) has shape \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  K} \)
	提取特征\( \mathbf{z} = \phi \left( \mathbf{x}\right) \) // \( \mathbf{z} \)的形状为\( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  K} \)
	Compute gradient \( {\mathbf{g}}_{\mathbf{z}} \) w.r.t features according to Equation (2.8)
	根据公式(2.8)计算特征的梯度\( {\mathbf{g}}_{\mathbf{z}} \)
	Compute \( {\widetilde{\mathbf{g}}}_{\mathbf{z},i,j} \) by avg. pooling using \( {50}\% \) Equation (2.9) or \( {50}\% \) Equation (2.10)
	使用公式(2.9)或公式(2.10)通过平均池化计算\( {\widetilde{\mathbf{g}}}_{\mathbf{z},i,j} \)
	Duplicate \( {\widetilde{\mathbf{g}}}_{\mathbf{z}} \) along channel/spatial dimension for initial shape
	沿通道/空间维度复制\( {\widetilde{\mathbf{g}}}_{\mathbf{z}} \)以恢复初始形状
	Compute mask \( {\mathbf{m}}_{i,j} \) according to Equation (2.11)
	根据公式(2.11)计算掩码\( {\mathbf{m}}_{i,j} \)
	Mask features to obtain \( {\widetilde{\mathbf{z}}}_{p} = \mathbf{m} \odot  \mathbf{z}\; \) // Evaluate effect of preliminary mask \( \downarrow \)
	掩码特征以获得\( {\widetilde{\mathbf{z}}}_{p} = \mathbf{m} \odot  \mathbf{z}\; \) // 评估初步掩码\( \downarrow \)的效果
	Compute change \( \mathbf{c} \) according to Equation (2.12)
	根据公式(2.12)计算变化\( \mathbf{c} \)
	Revert masking for specific samples according Equation (2.13)
	根据公式(2.13)对特定样本还原掩码
	Mask features \( \widetilde{\mathbf{z}} = \mathbf{m} \odot  \mathbf{z} \)
	掩码特征\( \widetilde{\mathbf{z}} = \mathbf{m} \odot  \mathbf{z} \)
	Compute loss \( \mathcal{L}\left( {w\left( \widetilde{\mathbf{z}}\right) ,\mathbf{y}}\right) \) and backpropagate to whole network
	计算损失\( \mathcal{L}\left( {w\left( \widetilde{\mathbf{z}}\right) ,\mathbf{y}}\right) \)并对整个网络进行反向传播
end
结束

end

结束


g~z,i,j from Equation (2.9) on the gradients of features from the previous convolutional layer,instead of the last one, and downsample it by factor 0.5 to match the size.

根据公式(2.9)g~z,i,j,对前一卷积层(而非最后一层)的特征梯度进行处理,并将其下采样0.5倍以匹配尺寸。

Our Results In an effort to provide somewhat fair results which aren't too optimistic and neither too penalizing, we run the original RSC code five times for each of the testing environments and compute the average performance in Table 2.3.

我们的结果 为了提供较为公平的结果,既不过于乐观也不过于苛刻,我们对每个测试环境运行原始RSC代码五次,并计算表2.3中的平均性能。

Run\( \mathbf{P} \)ACS
193.2381.6978.1181.14
293.4179.4477.3880.55
394.3780.0876.5879.18
493.7181.4978.8481.90
593.9579.3976.7581.19
Average93.7380.4177.5380.79
Reported95.9983.4380.3180.85
运行\( \mathbf{P} \)ACS
193.2381.6978.1181.14
293.4179.4477.3880.55
394.3780.0876.5879.18
493.7181.4978.8481.90
593.9579.3976.7581.19
平均93.7380.4177.5380.79
报告的95.9983.4380.3180.85

Table 2.3: Reproduced results for Representation Self-Challenging using the official code base on the PACS dataset and with a ResNet-18 backbone.

表2.3:使用官方代码库在PACS数据集和ResNet-18骨干网络上复现Representation Self-Challenging的结果。

Given these observations, we follow other works such as Nuriel, Benaim, and Wolf [137] and report our reproduced results whenever comparing to RSC.

鉴于这些观察结果,我们遵循Nuriel、Benaim和Wolf [137]等其他工作,在与RSC比较时报告我们的复现结果。

Explainability in Deep Learning

深度学习中的可解释性

Machine Learning systems and especially deep neural networks have the characteristic that they are often seen as "black-boxes" i.e. they are hard to interpret and pinpointing as a user how and why they converge to their prediction is often very difficult, if not impossible. Neural networks commonly lack transparency for human understanding [180]. This property becomes a prominent impediment for intelligent systems deployed in impactful sectors like, for example, employment [24, 149, 216], jurisdiction [74], healthcare [140], or banking loans where users would like to know the deciding factors for decisions. As such, we would like systems that are easily interpretable, relatable to the user, provide contextual information about the choice, and reflect the intermediate thinking of the user for a decision [199]. Since these properties are very broad, it is not surprising that researchers in this field have very different approaches. For this chapter, we used the field guide by Xie et al. [199] to paint an appropriate overview and properly introduce the different approaches. Commonly, methods try to provide better solutions with respect to Confidence, Trust, Safety, and Ethics to improve the overall explainability of the model [199]:

机器学习系统,尤其是深度神经网络,通常被视为“黑箱”,即难以解释,用户很难甚至不可能明确其预测的原因和过程。神经网络通常缺乏人类可理解的透明性[180]。这一特性成为智能系统在诸如就业[24, 149, 216]、司法[74]、医疗[140]或银行贷款等关键领域应用的显著障碍,用户希望了解决策的决定因素。因此,我们希望系统易于解释、与用户相关联、提供关于选择的上下文信息,并反映用户的中间思考过程[199]。由于这些属性非常广泛,研究者在该领域采用了多种不同的方法。本章采用Xie等人[199]的领域指南,概述并恰当介绍不同的方法。通常,这些方法试图在置信度、信任、安全性和伦理方面提供更好的解决方案,以提升模型的整体可解释性[199]:

Confidence The confidence of a machine learning system is high when the "reasoning" behind a decision between the model and the user matches often. For example, saliency attention maps [87, 138] ensure that semantically relevant parts of an image get considered and therefore increase confidence.

置信度 当机器学习系统的“推理”与用户的决策理由高度一致时,置信度较高。例如,显著性注意力图[87, 138]确保图像中语义相关部分被考虑,从而提升置信度。

Trust Trust is established when the decision of an intelligent system doesn't need to be validated anymore. Recently, many works have studied the problem of whether a model can safely be adopted [63,93,188] . To be able to trust a model,we need to ensure satisfactory testing of the model and users need experience with it to ensure that the results commonly match the expectation [199].

信任 当智能系统的决策无需再被验证时,信任得以建立。近期许多研究关注模型是否可以安全采用[63,93,188]。为了建立信任,我们需要确保模型经过充分测试,且用户有使用经验以保证结果通常符合预期[199]。

Safety Safety needs to be high for machine learning systems that have an impact on people's lives in any form. As such, the model should perform consistently as expected, prevent choices that may hurt the user or society, have high reliability under all operating conditions, and provide feedback on how the operating conditions influence the behavior.

安全性 对于对人们生活有影响的机器学习系统,安全性必须高。模型应表现稳定可靠,防止可能伤害用户或社会的选择,在所有操作条件下具有高可靠性,并反馈操作条件如何影响行为。

Ethics The ethics are defined differently depending on the moral principles of each user. In general, though, one can create an "ethics code" on which a system's decisions are based off [199]. Any sensitive characteristic e.g. religion, gender, disability, or sexual orientation are features that should be handled with great care. Similarly, we try to reduce the effect of any features that serve as a proxy for any type of discrimination process e.g. living in a specific part of a city, such as New York City's Chinatown, can be a proxy for the ethical background or income. Since this chapter gives a high-level overview of recent advances in explainability for neural networks, also concerning domain generalization, it is up to the reader's background if this is necessary.

伦理 伦理因用户的道德原则而异。一般而言,可以制定“伦理准则”作为系统决策的依据[199]。任何敏感特征,如宗教、性别、残疾或性取向,都应谨慎处理。同样,我们努力减少任何作为歧视代理的特征影响,例如居住在纽约市华埠等特定区域可能成为伦理背景或收入的代理。鉴于本章提供了神经网络可解释性及领域泛化的高层次概述,是否深入取决于读者背景。

3.1 Related topics

3.1 相关主题

There exist several concepts which are related to explainable deep learning. Here, we explicitly cover model debugging which tries to identify aspects that hinder training inference, and fairness and bias which especially tackles the ethics trait to search for differences in regular and irregular activation patterns to promote robust and trustworthy systems [199].

存在若干与可解释深度学习相关的概念。这里我们特别涵盖模型调试,旨在识别阻碍训练推理的因素,以及公平性和偏见,特别关注伦理特征,寻找常规与非常规激活模式的差异,以促进稳健可信的系统[199]。

3.1.1 Model Debugging

3.1.1 模型调试

Model debugging, similar to traditional software debugging, tries to pinpoint aspects of the architecture, data-processing, or training process which cause errors [199]. It aims at giving more insights into the model, allowing easier solving of faulty behavior. While such approaches help to open the black-box of neural network architectures, we handle them distinctly from the other literature here.

模型调试类似于传统软件调试,旨在定位架构、数据处理或训练过程中导致错误的因素[199]。其目标是深入了解模型,便于解决故障行为。虽然此类方法有助于揭开神经网络架构的黑箱,但我们在此将其与其他文献区分开来处理。

Amershi et al. [7] propose MODELTRACKER which is a debugging framework and interactive visualization that displays traditional statistics like Area Under the Curve (AUC) or confusion matrices. It also shows how close samples are in the feature space and allows users to expand the visualization to show the raw data or annotate them. Alain and Bengio [3] deploy linear classifiers to predict the information content in intermediate layers where the features of every layer serve as input to a separate classifier. They show that using features from deeper layers improves prediction accuracy and that level of linear separability increases monotonically. Fuchs et al. [59] introduce neural stethoscopes as a framework for analyzing factors of influence and interactively promoting and suppressing information. They extend the ordinary DNN architecture via a parallel two-layer perceptron at different locations where the input are the features from any layer from the main architecture. This stethoscope is then trained on a supplemental task and the loss is back-propagated to the main model with weighting factor λ [59]. This factor controls if the stethoscope functions analytical (λ=0) ,auxiliary (λ>0) , or adversarial (λ<0) [59]. Further,Kang et al. [95] use model assertions which are functions for a model's input and output that indicate when errors may be occurring. They show that with these they can solve problems where car detection in successive frames disappears and reappears [95]. Their model debugging is therefore implemented through a verification system [199].

Amershi 等人[7]提出了MODELTRACKER,这是一种调试框架和交互式可视化工具,展示了传统统计指标如曲线下面积(AUC)或混淆矩阵。它还显示样本在特征空间中的接近程度,并允许用户展开可视化以显示原始数据或对其进行注释。Alain 和 Bengio [3] 部署线性分类器来预测中间层的信息内容,其中每层的特征作为单独分类器的输入。他们表明,使用更深层的特征可以提高预测准确性,且线性可分性水平单调增加。Fuchs 等人[59]引入神经听诊器(neural stethoscopes)作为分析影响因素并交互式促进或抑制信息的框架。他们通过在不同位置添加一个并行的两层感知机扩展了普通深度神经网络(DNN)架构,输入为主架构任意层的特征。该听诊器随后在辅助任务上训练,损失通过加权因子λ[59]反向传播到主模型。该因子控制听诊器的功能是分析性(λ=0)、辅助性(λ>0)还是对抗性(λ<0)[59]。此外,Kang 等人[95]使用模型断言,即针对模型输入和输出的函数,用以指示可能发生错误的时刻。他们展示了利用这些断言可以解决连续帧中车辆检测消失和重新出现的问题[95]。因此,他们的模型调试通过验证系统实现[199]。

3.1.2 Fairness and Bias

3.1.2 公平性与偏见

To secure model fairness, there exist several definitions which have emerged in the literature in recent years. Group fairness [25], also known as demographic parity or statistical parity, aims at equalizing benefits across groups with respect to protected characteristics (e.g. religion, gender, etc.). By definition,if group A has twice as many members as group B ,twice as many people in group A should receive the benefit when compared to B [199]. On the other hand,individual fairness [47] tries to secure that similar feature inputs get treated similarly. There also exist other notions of fairness such as equal opportunity [75], disparate mistreatment [204], or other variations [78, 197].

为了保障模型公平性,近年来文献中出现了多种定义。群体公平(group fairness)[25],也称为人口统计平等或统计平等,旨在使不同群体在受保护特征(如宗教、性别等)方面获得的利益均等。按定义,如果群体A的成员数量是群体B的两倍,则群体A中获得利益的人数也应是群体B的两倍[199]。另一方面,个体公平(individual fairness)[47]试图确保相似的特征输入得到相似的对待。还有其他公平性概念,如机会均等(equal opportunity)[75]、差异性误用(disparate mistreatment)[204]或其他变体[78, 197]。

Methods that try to ensure fairness in machine learning systems can be classified into three approaches which operate during different steps called pre-processing, in-processing, post-processing:

旨在确保机器学习系统公平性的方法可分为三类,分别在预处理(pre-processing)、处理中(in-processing)和后处理(post-processing)阶段操作:

  1. Pre-processing methods adapt the input data beforehand to remove features correlated to protected characteristics. As such, they try to learn an alternative feature representation without relying on these types of attributes [70, 120, 144, 208].
  1. 预处理方法预先调整输入数据,去除与受保护特征相关的属性。通过这种方式,它们尝试学习一种不依赖于这些属性的替代特征表示[70, 120, 144, 208]。
  1. In-processing approaches add adjustments for fairness during the model learning process. This way,they punish decisions that are not aligned with certain fairness constraints [1,45,48] .
  1. 处理中方法在模型学习过程中加入公平性调整,从而惩罚不符合特定公平性约束[1,45,48]的决策。
  1. Post-processing methods adjust the model predictions after training to account for fairness. It is the reassignment of class labels after classification to minimize classification errors subject to a particular fairness constraint [53, 75, 146].
  1. 后处理方法在训练后调整模型预测,以考虑公平性。这是指在分类后重新分配类别标签,以在满足特定公平性约束的前提下最小化分类错误[53, 75, 146]。

3.2 Previous Works

3.2 相关工作

Generally, we can divide methods for explainable deep neural networks in visualization, model distillation, and intrinsic methods [199]. While visualization methods try to highlight features that strongly correlate with the output of the DNN, model distillation builds upon a jointly trained "white-box" model, following the input-output behavior of the original architecture and aiming to identify its decision rules [199]. Finally, intrinsic methods are networks designed to explain their output, hence they aim to optimize both, its performance and the respective explanations [199].

一般而言,可解释深度神经网络的方法可分为可视化、模型蒸馏和内在方法[199]。可视化方法试图突出与DNN输出高度相关的特征,模型蒸馏则基于联合训练的“白盒”模型,模仿原始架构的输入输出行为,旨在识别其决策规则[199]。最后,内在方法是设计用以解释其输出的网络,因此它们旨在同时优化性能和相应的解释[199]。

3.2.1 Visualization

3.2.1 可视化

Commonly, visualization methods use saliency maps to display the saliency values of the features i.e. to which degree the features influence the model's prediction [199]. We can further divide visualization methods into back-propagation and perturbation-based approaches where they respectively determine these values based on the volume of the gradient or between modified versions of the input [199].

常见的可视化方法使用显著性图(saliency maps)展示特征的显著性值,即特征对模型预测的影响程度[199]。我们可以进一步将可视化方法分为基于反向传播和基于扰动的方法,分别根据梯度大小或输入的修改版本之间的差异确定这些值[199]。

Back-Propagation

反向传播

These approaches stick to the gradient passed through the network to determine the relevant features. As a simplistic baseline, one can display the partial derivative with respect to each input feature multiplied by its value [199]. This way, Simonyan, Vedaldi, and Zisserman [170] and Springenberg et al. [175] assess the sensitivity of the model for input changes [199]. This can also be done for collections of intermediate layers [11,129,168,206] .

这些方法沿用网络中传递的梯度来确定相关特征。作为简单基线,可以显示相对于每个输入特征的偏导数乘以其值[199]。通过这种方式,Simonyan、Vedaldi 和 Zisserman [170]以及 Springenberg 等人[175]评估了模型对输入变化的敏感性[199]。这也可以应用于中间层集合[11,129,168,206]

Zhou et al. [217] introduce class activation maps (CAMs) which are shown in Figure 3.1 based on global average pooling (GAP) [118]. With GAP, they deploy the following CNN structure at the end of the network: GAP (Convs) Fully Connected Layer (FC) softmax where CAMs Mc for each class c are then calculated according to Equation (3.1). Here, K are the number of convolutional filters, z are the activations of the last convolutional layer,and wk,c indicate the weights from the feature map k of the last Convolutional Layer to logit for class c of the FC [217].

Zhou 等人[217]引入了基于全局平均池化(GAP)[118]的类激活映射(CAMs),如图3.1所示。利用GAP,他们在网络末端部署了以下卷积神经网络结构:GAP(卷积层)全连接层(FC)softmax,其中每个类别的CAMsMc根据公式(3.1)计算。这里,K是卷积滤波器的数量,z是最后卷积层的激活值,wk,c表示从最后卷积层的特征图k到全连接层中类别c的权重[217]。

(3.1)Mc=kKzkwk,c

By upsampling the map to the image size, they can visualize the image regions responsible for a certain class [217]. Therefore, every class has its own class activation map. The drawback of their approach is,that their method can only be applied to networks that use the GAP (Convs) Fully Connected Layer (FC) softmax architecture at the end [199].

通过将映射上采样到图像大小,他们可以可视化对某一类别负责的图像区域[217]。因此,每个类别都有其对应的类激活映射。他们方法的缺点是,该方法仅适用于在末端使用GAP(卷积层)全连接层(FC)softmax架构的网络[199]。

Figure 3.1: Class activation maps across different architectures: [217]

图3.1:不同架构下的类激活映射:[217]

Selvaraju et al. [164] solve this impediment by generalizing CAMs to gradient-weighted class activation maps (Grad-CAMs). Since their approach only requires the final activation function to be differentiable,they are generally applicable to a broader range of CNN architectures [164, 199]. For that,they compute an importance score g~z,ck as:

Selvaraju 等人[164]通过将CAMs推广为梯度加权类激活映射(Grad-CAMs)解决了这一限制。由于他们的方法仅要求最终激活函数可微,因此普遍适用于更广泛的卷积神经网络架构[164, 199]。为此,他们计算了一个重要性分数g~z,ck,定义如下:

(3.2)g~z,ck=1HzWzi=1Hzj=1Wzyczi,jk.

Here, yc is the score before softmax and we calculate the gradient with respect to the feature map zk in the final convolutional layer for every neuron positioned at(i,j)in the Hz×Wz feature map [164,199] . Afterward,these importance scores get linearly combined for every feature map as shown in Equation (3.3) where they get additionally passed through a ReLU function:

这里,yc是softmax之前的得分,我们计算相对于最终卷积层中每个位置(i,j)的特征图Hz×Wz中神经元的梯度zk。随后,这些重要性分数按公式(3.3)线性组合,每个特征图经过ReLU函数处理:

(3.3)Mc=max(0,k=1Kg~z,ckzk).

This computation inherently yields a Hz×Wz importance map ( 14×14 for VGG [171] and AlexNet [100],7×7 for ResNet [77]) where we upsample it using bilinear interpolation onto the image size to yield the Grad-CAM.

该计算本质上产生了一个Hz×Wz重要性图(VGG[171]对应14×14,AlexNet对应[100],7×7,ResNet[77]对应[100],7×7),我们通过双线性插值将其上采样到图像大小,得到Grad-CAM。

Apart from CAMs, there also exist other methods like layer-wise relevance propagation [11, 41, 103, 129], deep learning important features (DeepLIFT) [168], or integrated gradients [181] which are not described in detail here. Please refer to Xie et al. [199] or the original works for more information.

除了CAMs,还有其他方法如层级相关传播(layer-wise relevance propagation)[11, 41, 103, 129]、深度学习重要特征(DeepLIFT)[168]或积分梯度(integrated gradients)[181],此处不作详细描述。更多信息请参见Xie 等人[199]或相关原始文献。

Perturbation

干扰法

Perturbation methods alternate the input features to compute their respective relevance for the model's output by comparing the differences between the original and permutated version.

干扰法通过改变输入特征,比较原始与扰动版本的差异,计算各特征对模型输出的相关性。

Zeiler and Fergus [206] sweep a gray patch over the image to determine how the model will react to occluded areas. Once, an area with a high correlation to the output is covered, the prediction performance drops [199, 206]. Li, Monroe, and Jurafsky [113] deploy a similar idea for NLP tasks where they erase words and measure the influence on the model's performance. Fong and Vedaldi [57] define three perturbations i) replacing patches with a constant value, ii) adding noise to a region, and iii) blurring the area [57, 199]. Zintgraf et al. [220] propose a method based on Robnik-Sikonja and Kononenko [156] where they calculate the relevance of a feature for class c through the prediction difference between including the respective feature or occluding it [220]. For that, they simulate the absence of each feature. A positive value for their computed difference means the feature influences the model’s decision for class c and a negative value means the feature influences the prediction against class c [220]. Zintgraf et al. [220] extend the initial method by Robnik-Sikonja and Kononenko [156] via removing patches instead of pixels and adapting the method for intermediate layers [199].

Zeiler 和 Fergus [206]通过在图像上滑动灰色遮挡块,观察模型对遮挡区域的反应。一旦遮挡区域与输出高度相关,预测性能便下降[199, 206]。Li、Monroe 和 Jurafsky [113]在自然语言处理任务中采用类似思路,删除词语并测量对模型性能的影响。Fong 和 Vedaldi [57]定义了三种干扰方式:i)用常数值替换图像块,ii)向区域添加噪声,iii)模糊该区域[57, 199]。Zintgraf 等人[220]基于Robnik-Sikonja 和 Kononenko [156]的方法,计算特征对类别c的相关性,方法是比较包含该特征与遮挡该特征时的预测差异[220]。为此,他们模拟每个特征的缺失。计算差异为正值表示该特征对类别c的决策有正向影响,负值则表示对类别c的预测有负向影响[220]。Zintgraf 等人[220]通过移除图像块而非像素,并将方法扩展至中间层,改进了Robnik-Sikonja 和 Kononenko [156]的初始方法[199]。

3.2.2 Model distillation

3.2.2 模型蒸馏

Model distillation methods allow for post-training explanations where we learn a distilled model which imitates the original model's decisions on the same data [199]. It has access to information from the initial model and can therefore give insights about the features and output correlations [199]. Generally, we can divide these methods into local approximation and model translation approaches. These either replicate the model behavior on a small subset of the input data based on the idea that the mechanisms a network uses to discriminate in a local area of the data manifold is simpler (local approximation) or stick to using the entire dataset with a smaller model (model translation) [199].

模型蒸馏方法允许在训练后进行解释,我们通过学习一个蒸馏模型来模仿原始模型在相同数据上的决策[199]。该模型可以访问初始模型的信息,因此能够提供关于特征和输出相关性的见解[199]。通常,我们可以将这些方法分为局部近似和模型转换两类。这些方法要么基于这样一个理念:网络在数据流形的局部区域内进行判别的机制更简单,从而在输入数据的小子集上复制模型行为(局部近似);要么使用整个数据集配合更小的模型进行学习(模型转换)[199]。

Local approximations

局部近似

Even though it may seem unintuitive to pursue approaches that don't explain every decision made by the DNN, practitioners often want to interpret decisions made for a specific data subset e.g. employee performance indicators for those fired with poor performance [199]. One of the most popular local approximations is the method proposed by Ribeiro, Singh, and Guestrin [153] called local interpretable model-agnostic explanations (LIME). They propose a notation where from an unexplainable global model fθ and an original representation of an instance xi we want an interpretable model gθ from the class of potentially interpretable models gθG . Since not all models gθ have the same degree of interpretability,they define a complexity measure Π(gθ) which could be the depth of the tree for decision trees or the number of non-zero weights in linear models [153]. They incorporate this complexity measure,together with the prediction of gθ for fθ in a certain locality,in their loss term. There also exist many other works which build upon LIME to solve certain drawbacks [50, 154]. We don't go into details here as it is only partially related to this thesis.

尽管追求不解释深度神经网络(DNN)所有决策的做法看似不合直觉,实践者往往希望解释特定数据子集的决策,例如针对因绩效不佳被解雇员工的绩效指标[199]。最流行的局部近似方法之一是Ribeiro、Singh和Guestrin提出的局部可解释模型无关解释(LIME)[153]。他们提出一种符号表示,从一个不可解释的全局模型fθ和一个实例的原始表示xi,我们希望得到一个可解释模型gθ,该模型属于潜在可解释模型类gθG。由于并非所有模型gθ的可解释程度相同,他们定义了一个复杂度度量Π(gθ),该度量可以是决策树的深度或线性模型中非零权重的数量[153]。他们将该复杂度度量与gθ在某一局部区域对fθ的预测一起纳入损失项中。还有许多基于LIME的后续工作旨在解决其某些缺陷[50, 154]。这里不做详细介绍,因为这与本论文仅部分相关。

Model Translation

模型转换

The idea of model translation is to mimic the behavior of the original deep neural network on the whole dataset, contrary to local approximations which only use a smaller subset. Some works have tried to distill neural networks into decision trees [58,182,215] ,finite state automata [82],Graphs [213-215], or causal- and rule-based models [76, 134]. Generally, the distilled models could be easier to deploy, faster to converge, or simply be more explainable [199].

模型转换的思想是模仿原始深度神经网络在整个数据集上的行为,这与仅使用较小子集的局部近似方法相反。一些工作尝试将神经网络蒸馏为决策树[58,182,215]、有限状态自动机[82]、图结构[213-215],或因果和基于规则的模型[76, 134]。一般来说,蒸馏模型可能更易部署、收敛更快,或更具可解释性[199]。

3.2.3 Intrinsic methods

3.2.3 内在方法

Finally, intrinsic methods jointly output an explanation in combination with their prediction. In an ideal world, such methods would be on par with state-of-the-art models without explainability. This approach introduces an additional task that gets jointly trained with the original task of the model [199]. The additional task usually tries to provide either text explanations [26, 79, 81, 207], an explanation association [6,44,90,106] ,or prototypes [32,114] which differ in the provided explainability type as well as the degree of insight.

最后,内在方法在输出预测的同时联合输出解释。在理想情况下,此类方法的性能应与无解释性的最先进模型相当。这种方法引入了一个与模型原始任务联合训练的附加任务[199]。该附加任务通常尝试提供文本解释[26, 79, 81, 207]、解释关联[6,44,90,106]或原型[32,114],它们在提供的可解释类型及洞察深度上有所不同。

Attention mechanism

注意力机制

The attention mechanism [189] takes motivation from the human visual focus and peripheral perception [162]. With that, humans can focus on certain regions to achieve high resolution while adjacent objects are perceived with a rather low resolution [162]. In the attention mechanism, we learn a conditional distribution over given inputs using weighted contextual alignment scores (attention weights) [199]. These allow for insights on how strongly different input features are considered during model inference [199]. The alignment scores can be computed differently, for example, either content-based [71], additive [13], based on the matrix dot-product [121], or as a scaled version of the matrix dot-product [189]. Especially due to the transformer architecture [189], attention has shown to improve the neural network performance originally in natural language processing [23, 39, 102], but also more recently in image classification and other computer vision tasks [8,205] . It has also been shown that attention is the update rule of a modern Hopfield network with continuous states [152], an architecture that hasn't been used very much in modern neural network models. There has also been a discussion on whether attention counts as explanation and to which degree the process offers insights into the inner workings of a neural network [91,165,196] .

注意力机制[189]的灵感来源于人类的视觉焦点和周边感知[162]。通过这种机制,人类能够聚焦于特定区域以实现高分辨率的感知,而相邻物体则以较低分辨率被感知[162]。在注意力机制中,我们通过加权的上下文对齐分数(注意力权重)[199]学习给定输入的条件分布。这些权重揭示了在模型推理过程中不同输入特征被考虑的强度[199]。对齐分数的计算方式多样,例如基于内容[71]、加法[13]、矩阵点积[121]或矩阵点积的缩放版本[189]。尤其是由于Transformer架构[189]的出现,注意力机制最初在自然语言处理领域[23, 39, 102]显著提升了神经网络性能,近年来也被应用于图像分类及其他计算机视觉任务[8,205]。此外,研究表明注意力机制是具有连续状态的现代Hopfield网络[152]的更新规则,这种架构在现代神经网络模型中尚未被广泛应用。关于注意力是否可作为解释手段以及其在揭示神经网络内部工作机制中的作用程度,也存在一定的讨论[91,165,196]

Text explanations

文本解释

Text explanations are natural language outputs that explain the model decision using a form like "This image is of class A because of B ". As such,they are quite easy to understand regardless of the user's background. Works that take this approach are, for example, Hendricks et al. [79] or Park et al. [138]. Drawbacks of textual explanations are that they i) require supervision for explanations during training and ii) explanations have been shown to be inconsistent which questions the validity of these types of explanations [27].

文本解释是以自然语言形式输出的解释,用于说明模型决策,通常采用“该图像属于A类别,因为B”的表达方式。因此,无论用户背景如何,这类解释都较易理解。采用此方法的工作例如Hendricks等人[79]或Park等人[138]。文本解释的缺点在于:i) 训练过程中需要对解释进行监督;ii) 解释结果存在不一致性,质疑了这类解释的有效性[27]。

Explanation association

解释关联

Latent features or input elements that are combined with human-understandable concepts are classified under explanation associations. Such explanations either combine input or latent features with semantic concepts, associate the model prediction with a set of input elements, or utilize object saliency maps to visualize relevant image parts [199].

将潜在特征或输入元素与人类可理解的概念结合的解释方法归类为解释关联。这类解释要么将输入或潜在特征与语义概念结合,要么将模型预测与一组输入元素关联,或利用对象显著性图来可视化相关图像部分[199]。

Prototypes

原型

Finally,model prototype approaches are specifically designed for classification tasks [18, 97, 147, 198]. The term prototype in few- and zero-shot learning settings are points in the feature space representing a single class [114]. In such methods, the distance to the prototype determines how an observation is classified. The prototypes are not limited to a single observation but can also be obtained using a combination of observations or latent representations [199]. A criticism, on the other hand, is a data instance that is not well represented by the set of prototypes [128]. To obtain explainability using prototypes, one can trace the reasoning path for the prediction back to the learned prototypes [199]. Li et al. [114] use a prototype layer to deploy an explainable image classifier. They propose an architecture with an autoencoder and a prototype classifier. This prototype classifier calculates the 2 distance between the encoded input and each of the prototypes, passes this through a fully connected layer to compute the sums of these distances, and finally normalizes them through a softmax layer [114]. Since these prototypes live in the same space as the encoded inputs, they can be visualized with a jointly trained decoder [114]. This property, coupled with the fully connected weights, allows for explainability through visualization of the prototypes and their respective influence on the prediction. Figure 3.2 shows the visualizations for the generic number prototypes obtained by Li et al. [114] on the MNIST [105] and the car angle prototypes on the Car [55] dataset.

最后,模型原型方法专门针对分类任务设计[18, 97, 147, 198]。在少样本和零样本学习设置中,原型(prototype)指的是特征空间中代表单一类别的点[114]。在此类方法中,观测样本与原型的距离决定其分类结果。原型不仅限于单个观测样本,也可通过多个观测样本或潜在表示的组合获得[199]。批评意见则指出存在某些数据实例无法被原型集合良好代表[128]。通过原型实现可解释性,可以追溯预测的推理路径至学习到的原型[199]。Li等人[114]使用原型层部署了可解释的图像分类器。他们提出了包含自编码器和原型分类器的架构。该原型分类器计算编码输入与各原型之间的2距离,通过全连接层计算这些距离的和,最后通过softmax层进行归一化[114]。由于这些原型与编码输入处于同一空间,可通过联合训练的解码器进行可视化[114]。这一特性结合全连接权重,使得通过可视化原型及其对预测的影响实现解释成为可能。图3.2展示了Li等人[114]在MNIST[105]和Car[55]数据集上获得的通用数字原型和汽车角度原型的可视化。

Figure 3.2: Prototypes for the MNIST (left) and Car (right) dataset: [114]

图3.2:MNIST(左)和Car(右)数据集的原型:[114]

Chen et al. [32] introduce a prototypical part network (ProtoPNet) which has similar components to Li et al. [114], namely a convolutional neural network projecting onto a latent space and a prototype classifier. The approach chosen by Chen et al. [32] is different as the prototypes are more fine-grained and represent parts of the input image [199]. Hence, their model associates image patches with prototypes for explanations [199]. Figure 3.3 illustrates this approach for bird species classification.

Chen等人[32]提出了原型部件网络(ProtoPNet),其组件与Li等人[114]类似,即卷积神经网络映射到潜在空间和原型分类器。Chen等人[32]的方法不同之处在于原型更为细粒度,代表输入图像的部分[199]。因此,他们的模型将图像块与原型关联以实现解释[199]。图3.3展示了该方法在鸟类物种分类中的应用。

Figure 3.3: Image of a clay colored sparrow and its decomposition into prototypes: [32]

图3.3:泥色麻雀图像及其分解为原型:[32]

As a general framework,we would like to learn m prototypes P={pj}j=1m with pjRHp×Wp×K which each resemble a prototypical activation pattern in a patch of the convolutional output [32]. Each prototype unit gpj of the prototype layer gp computes some distance metric (e.g. 2 norm) between the j -th prototype pj and all patches of z with the same shape as pj and inverts that into a similarity score using some mapping function [32]. This computation yields a similarity map ΨjRHz×Wz which shows how representative the j -th prototype is for each latent patch and this can be upsampled to the initial image size for an overlay heatmap [32]. When max-pooling this similarity map, we obtain a similarity score that measures how strong the j -th prototype is represented by any latent patch.

作为一个通用框架,我们希望学习m个原型P={pj}j=1m,每个原型pjRHp×Wp×K都类似于卷积输出某个区域的典型激活模式[32]。原型层的每个原型单元gpj计算第j个原型pj与所有与pj形状相同的z区域之间的某种距离度量(例如2范数),并通过某种映射函数将其转换为相似度分数[32]。该计算产生一个相似度图ΨjRHz×Wz,显示第j个原型对每个潜在区域的代表性,该图可以上采样到初始图像大小以生成叠加热力图[32]。对该相似度图进行最大池化后,我们得到一个相似度分数,用以衡量第j个原型在任意潜在区域中的表现强度。

Chen et al. [32] compute the maximum similarity score for each prototype unit gpj by:

Chen等人[32]通过以下方式计算每个原型单元gpj的最大相似度分数:

(3.4)gpj(z)=maxzpatches(z)log(zpj22+1zpj22+ϵ),

where the squared 2 distance is used and ϵ is a small numerical stability factor. Their modified logarithm function satisfies the property of a similarity mapping function since with an increasing 2 - norm the function returns a smaller value i.e. larger distance values correspond to smaller similarities. Keep in mind, that this needs to be appropriately adjusted when using any other distance metric. For example, both the cosine and dot product measures have increasing similarity values for increasing distance values. In theory, any Bregman divergence [16] is applicable as a distance metric. However, Snell,Swersky,and Zemel [173] have observed that this choice can be very impactful and the 2 -norm has a better performance than cosine distance for few-shot tasks.

其中使用了平方2距离,ϵ是一个小的数值稳定因子。他们修改后的对数函数满足相似度映射函数的性质,因为随着2范数的增大,函数返回的值减小,即距离越大对应的相似度越小。请注意,使用其他距离度量时需要相应调整。例如,余弦相似度和点积度量的相似度值会随着距离值的增大而增大。理论上,任何Bregman散度[16]都可作为距离度量。然而,Snell、Swersky和Zemel[173]观察到这一选择影响显著,且对于少样本任务,2范数的表现优于余弦距离。

To enforce that every class c will be represented by at least one prototype,there are a pre-determined number of prototypes for each class which is denoted as Pc with PcP . During training,Chen et al. [32] minimize the objective:

为了确保每个类别c至少由一个原型表示,每个类别预设了固定数量的原型,记为Pc,其中PcP。在训练过程中,Chen等人[32]最小化目标函数:

(3.5)minP,θϕ1ni=1nLce(wgpϕ Prediction y^i,yi)+λ1Lclst +λ2Lsep ,

where the cluster loss Lclst  and separation loss Lsep  are defined as:

其中聚类损失Lclst 和分离损失Lsep 定义为:

(3.6)Lclst =1ni=1nminj:pjPyiminzpatches(z)zpj22

(3.7)Lsep =1ni=1nminj:pjPyiminz patches (z)zpj22.

Note, that this is only the first part of a multi-step training procedure where Equation (3.5) solely optimizes the parameters of the featurizer θϕ and the prototypes P ,but not the classifier as its weights θw are frozen with an initialization for each connection wc,j between the j -th prototype unit gpj and the logit for class c and j:pjPc as wc,j=1 while j:pjPc it is set to wc,j=0.5 . The positive connection for the similarity to a prototype of that specific class increases the prediction value for class c while the negative connection for the similarity to a prototype of a different class decreases it. This initialization, together with the separation loss, guide the prototypes to represent semantic concepts for a class but also ensure that the same semantic concept is not learned by the other classes. Later on,Chen et al. [32] optimize the classifier parameters θw for sparsity while fixing all other parameters to reduce the effect of negative network reasoning for classification.

注意,这只是一个多步骤训练过程的第一部分,其中方程(3.5)仅优化特征提取器(featurizer)θϕ和原型(prototypes)P的参数,而不优化分类器,因为其权重θw被冻结,初始化为每个连接wc,j,连接的是第j个原型单元gpj与类别c的logitj:pjPc,初始化值为wc,j=1,而j:pjPc被设为wc,j=0.5。对该特定类别原型的相似性正连接会增加类别c的预测值,而对不同类别原型的相似性负连接则会降低该值。该初始化与分离损失(separation loss)共同引导原型表示类别的语义概念,同时确保其他类别不会学习相同的语义概念。随后,Chen等人[32]在固定其他参数的情况下,优化分类器参数θw以实现稀疏性,从而减少负面网络推理对分类的影响。

While Li et al. [114] need a decoder for visualizing the prototypes, Chen et al. [32] don't require this component since they visualize the closest latent image patch across the full training dataset instead of directly visualizing the prototype [32]. They also show that when combining several of their networks into a larger network, their method is on par with best-performing deep models [32].

Li等人[114]需要解码器来可视化原型,而Chen等人[32]则不需要该组件,因为他们通过在整个训练数据集中可视化最接近的潜在图像块来代替直接可视化原型[32]。他们还展示了将多个网络组合成更大网络时,该方法的性能与表现最佳的深度模型相当[32]。

3.3 Explainability for Domain Generalization

3.3 域泛化的可解释性

To the best of our knowledge, the only work using explanations for domain generalization is by Zunino et al. [221]. They introduce a saliency-based approach utilizing a 2D binary map of pixel locations for the ground-truth object segmentation as input. This map contains a 1 in a pixel location where the class label object is present and 0 otherwise. Even though they were able to show that their method better focuses on relevant regions, we identify the additional annotations of class label objects as a major drawback which we solve in this work. Indeed, we not only avoid the use of annotations by directly restoring to Grad-CAMs (see Section 4.1), but our idea is also different in spirit. In fact, we do not force the network to focus on relevant regions as in [221] but we want the network to use i) different explanations for the same objects (to achieve better generalization) and ii) the explanations to be consistent across domains (to avoid overfitting a single input distribution).

据我们所知,唯一使用解释方法进行域泛化的工作是Zunino等人[221]提出的。他们引入了一种基于显著性的方式,利用二维二值像素位置图作为输入,该图对应真实物体分割。该图中,物体类别标签所在像素位置为1,其余为0。尽管他们证明了该方法更好地聚焦于相关区域,但我们认为对类别标签物体的额外标注是一个主要缺点,而我们在本工作中解决了这一问题。实际上,我们不仅避免了使用标注,直接采用Grad-CAM(见第4.1节),而且我们的思路也有所不同。我们并不强制网络聚焦于相关区域(如[221]所做),而是希望网络能够i) 对相同物体使用不同的解释(以实现更好的泛化),ii) 解释在不同域间保持一致(以避免对单一输入分布的过拟合)。

Proposed Methods

提出的方法

In order to apply some of the previously mentioned topics from the explainability literature to the domain generalization task, we specifically investigate the usage of gradient class activation maps from Section 3.2.1, as well as prototypes from Section 3.2.3. Our methods which are based upon these approaches are respectively described in Section 4.1 (DIVCAM), Section 4.2 (ProDROP), and Section 4.2.3 (D-TRANSFORMERS).

为了将前述可解释性文献中的部分内容应用于域泛化任务,我们特别研究了第3.2.1节中的梯度类激活图(gradient class activation maps)以及第3.2.3节中的原型。基于这些方法,我们分别在第4.1节(DIVCAM)、第4.2节(ProDROP)和第4.2.3节(D-TRANSFORMERS)中描述了相应的方法。

4.1 Diversified Class Activation Maps (DivCAM)

4.1 多样化类激活图(Diversified Class Activation Maps,DivCAM)

In Section 2.6 we introduce the concept of Representation Self-Challenging for domain generalization while in Section 3.2.1 class activation maps, and specifically Grad-CAM gets introduced. It is quite easy to see that the importance scores g~z,ck in Grad-CAM from Equation (3.2) are a generalization of the spatial mean g~z used in Channel-Wise RSC from Equation (2.10). The spatial mean g~z only computes the gradient with respect to the features for the most probable class while the importance scores g~z,ck are formulated theoretically for all possible classes but similarly compute the gradient with respect to the feature representation. Both perform spatial average pooling.

在第2.6节中,我们介绍了用于域泛化的表示自我挑战(Representation Self-Challenging)概念,而在第3.2.1节中引入了类激活图,特别是Grad-CAM。很容易看出,Grad-CAM中方程(3.2)的权重分数g~z,ck是通道维度自我挑战(Channel-Wise RSC)中方程(2.10)所用空间均值g~z的推广。空间均值g~z仅计算最可能类别对应特征的梯度,而权重分数g~z,ck理论上针对所有可能类别计算梯度,但同样是针对特征表示。两者均执行空间平均池化。

Despite the effectiveness of the model, we believe the approach of Huang et al. [86] does not fully exploit the relation between a feature vector and the actual content of the image. We argue (and experimentally demonstrate) that we can directly use CAMs to construct the self-challenging task. In particular, while the raw target gradient represents the significance of each channel in each spatial location for the prediction, CAMs allow us to better capture the actual importance of each image region. Thus, performing the targeted masking on highest CAM values means explicitly excluding the most relevant region of the image that where used for the prediction, forcing the model to focus on other (and interpretable) visual cues for recognizing the object of interests.

尽管该模型效果显著,我们认为Huang等人[86]的方法并未充分利用特征向量与图像实际内容之间的关系。我们主张(并通过实验验证)可以直接使用类激活映射(CAMs)来构建自我挑战任务。具体而言,虽然原始目标梯度表示每个通道在每个空间位置对预测的重要性,CAMs则能更好地捕捉图像各区域的实际重要性。因此,在最高CAM值处进行有针对性的遮罩,意味着明确排除用于预测的图像最相关区域,迫使模型关注其他(且可解释的)视觉线索以识别目标对象。

Therefore, as an intuitive baseline, we propose Diversified Class Activation Maps (DIvCAM), combining the two approaches as shown in Algorithm 2 or visualized on a high level in Figure 4.1. For that, during each step of the training procedure, we extract the features, compute the gradients with respect to the features as in Equation (2.8),and perform spatial average pooling to yield g~z according to Equation (2.10). Our method deviates from Channel-Wise RSC by next computing class activation maps McRHz×Wz×1 according to Equation (4.1) for the ground truth class label.

因此,作为一个直观的基线,我们提出了多样化类激活映射(Diversified Class Activation Maps,DIvCAM),结合了两种方法,如算法2所示,或在图4.1中高层次可视化。为此,在训练过程的每一步,我们提取特征,按照公式(2.8)计算相对于特征的梯度,并执行空间平均池化以根据公式(2.10)得到g~z。我们的方法区别于通道级随机遮罩(Channel-Wise RSC),接下来根据公式(4.1)为真实类别标签计算类激活映射McRHz×Wz×1

(4.1)Mc=max(0,k=1Kg~z,ckzk)

Figure 4.1: Visualization of the DIVCAM training process

图4.1:DIVCAM训练过程的可视化

Based on these maps and similar to Equation (2.11),we compute a mask mRHz×Wz×1 for the Top- p percentile of map activations as:

基于这些映射,类似于公式(2.11),我们计算一个掩码mRHz×Wz×1,用于映射激活值的前p百分位,如下所示:

(4.2)mc,i,j={0, if Mc,i,jqp1, otherwise 

As class activation maps and the corresponding masks are averaged along the channel dimension to be specific for each spatial location, we duplicate the mask along all channels to yield a mask with the same size of the features mRHz×Wz×K which can directly be multiplied with the features to mask them and to regularize the training procedure:

由于类激活映射及对应掩码沿通道维度平均以针对每个空间位置,我们将掩码沿所有通道复制,得到与特征mRHz×Wz×K大小相同的掩码,可直接与特征相乘以遮罩它们并正则化训练过程:

(4.3)z~=mz,

where is the Hadamard product. The new feature vector z~ is used as input to the classifier w in place of the original z to regularize the training procedure. For the masked features,we compute the Cross-Entropy Loss from Equation (2.3) and backpropagate the gradient of the loss to the whole network to update the parameters.

其中为Hadamard乘积。新的特征向量z~作为分类器w的输入,替代原始的z,以正则化训练过程。对于被遮罩的特征,我们根据公式(2.3)计算交叉熵损失,并将损失梯度反向传播至整个网络以更新参数。

Intuitively, constantly applying this masking for all samples within each batch disregards important features and results in relatively poor performance as the network isn't able to learn discriminative features in the first place. Therefore, applying the mask only for certain samples within each batch as mentioned by Huang et al. [86, Secton 3.3] should yield a better performance. For convenience, we call this process mask batching. On top of that, one could schedule the mask batching with an increasing factor (e.g. linear schedule) such that masking gets applied more in the later training epochs where discriminative features have been learned. We apply the mask only if the sample n is within the(100 - b)th percentile of confidences for the correct class (stored in the change vector c for each sample n ) and reset the mask otherwise by setting each spatial location(i,j)back to 1 :

直观上,对每个批次中的所有样本持续应用此遮罩会忽略重要特征,导致性能较差,因为网络无法首先学习判别特征。因此,如Huang等人[86,第3.3节]所述,仅对每批次中的部分样本应用遮罩应能获得更好性能。为方便起见,我们称此过程为掩码批处理。此外,可以采用递增因子(如线性调度)安排掩码批处理,使得在后期训练阶段判别特征已被学习时,遮罩应用得更多。我们仅当样本n的正确类别置信度位于(100 - b)百分位内(置信度存储于每个样本n的变化向量c中)时应用遮罩,否则通过将每个空间位置(i,j)重置为1来取消遮罩:

(4.4)mc,i,jn={1, if cnqb, otherwise 

where cn is is the confidence on the ground truth for sample n . This procedure enforces that masks only get applied to samples which are already classified well enough such that the network can now focus on other discriminative properties. For our full ablation study on applying the masks within each batch, please see Section 5.4.2. Figure 4.2 also shows some of the class activation maps produced by DIVCAM throughout the training procedure.

其中cn为样本n对真实类别的置信度。此过程确保遮罩仅应用于已被较好分类的样本,使网络能够专注于其他判别特性。关于在每批次中应用遮罩的完整消融研究,请参见第5.4.2节。图4.2还展示了DIVCAM在训练过程中产生的一些类激活映射。

Algorithm 2: Diversified Class Activation Maps (DIVCAM)

算法2:多样化类激活映射(DIVCAM)


Input: Data X,Y with xiRH×W×3 ,drop factor p,b ,epochs T

输入:数据X,Y,样本数xiRH×W×3,丢弃因子p,b,训练轮数T

while epoch T do

当训练轮数为T

for every batch \( \mathbf{x},\mathbf{y} \) do
对于每个批次 \( \mathbf{x},\mathbf{y} \) 执行
	Extract features \( \mathbf{z} = \phi \left( \mathbf{x}\right) \) \( //\mathbf{z} \) has shape \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  K} \)
	提取特征 \( \mathbf{z} = \phi \left( \mathbf{x}\right) \) \( //\mathbf{z} \) 的形状为 \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  K} \)
	Compute \( {\mathbf{g}}_{\mathbf{z},c} \) with Equation (2.8)
	使用公式 (2.8) 计算 \( {\mathbf{g}}_{\mathbf{z},c} \)
	Compute \( {\widetilde{\mathbf{g}}}_{\mathbf{z},c}^{k} \) with Equation (2.10) // \( {\widetilde{\mathbf{g}}}_{\mathbf{z}} \) has shape \( {\mathbb{R}}^{1 \times  1 \times  K} \)
	使用公式 (2.10) 计算 \( {\widetilde{\mathbf{g}}}_{\mathbf{z},c}^{k} \) // \( {\widetilde{\mathbf{g}}}_{\mathbf{z}} \) 的形状为 \( {\mathbb{R}}^{1 \times  1 \times  K} \)
	Compute \( {\mathbf{M}}_{c} \) with Equation (4.1) // \( \mathrm{M} \) has shape \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  1} \)
	使用公式 (4.1) 计算 \( {\mathbf{M}}_{c} \) // \( \mathrm{M} \) 的形状为 \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  1} \)
	Compute \( {\mathbf{m}}_{c,i,j} \) with Equation (4.2)
	使用公式 (4.2) 计算 \( {\mathbf{m}}_{c,i,j} \)
	Repeat mask along channels // Afterwards \( \mathrm{m} \) has shape \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  K} \)
	沿通道重复掩码 // 之后 \( \mathrm{m} \) 的形状为 \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  K} \)
	Adapt \( {\mathbf{m}}_{c,i,j} \) with Equation (4.4)
	使用公式 (4.4) 调整 \( {\mathbf{m}}_{c,i,j} \)
	Compute \( \widetilde{\mathbf{z}} \) with Equation (4.3)
	使用公式 (4.3) 计算 \( \widetilde{\mathbf{z}} \)
	Backpropagate loss \( {\mathcal{L}}_{ce}\left( {w\left( \widetilde{\mathbf{z}}\right) ,\mathbf{y}}\right) \)
	反向传播损失 \( {\mathcal{L}}_{ce}\left( {w\left( \widetilde{\mathbf{z}}\right) ,\mathbf{y}}\right) \)
end
结束

end

结束


To try and improve the effectiveness of our CAM-based regularization approach, we can borrow some practices from the weakly-supervised object localization literature. In particular, we explore the use of Homogeneous Negative CAMs (HNC) [180] and Threshold Average Pooling (TAP) [12]. Both methods improve the performance of ordinary CAMs and focus them better on the relevant aspects of an image. See Section 5.4.3 for an evaluation of these variants.

为了尝试提升基于CAM(类激活映射)正则化方法的有效性,我们可以借鉴弱监督目标定位领域的一些做法。特别地,我们探讨了同质负类激活映射(Homogeneous Negative CAMs,HNC)[180]和阈值平均池化(Threshold Average Pooling,TAP)[12]的使用。这两种方法均提升了普通CAM的性能,并更好地聚焦于图像的相关部分。相关变体的评估见第5.4.3节。

4.1.1 Global Average Pooling bias for small activation areas

4.1.1 小激活区域的全局平均池化偏差

According to Bae, Noh, and Kim [12], one problem of traditional class activation maps is that the activated areas for each feature map differ by the respective channels because these capture different class information which isn't properly reflected in the global average pooling operation. Since every channel is globally averaged, smaller feature activation areas result in smaller globally averaged values despite a similar maximum activation value. This doesn't necessarily mean that one of the features is more relevant for the prediction, but can simply be caused by a large area with small activations. To combat this problem,the weight wk,c corresponding to the smaller value,is often trained to be higher when comparing two channels [12]. Instead of the global average pooling operation, they propose Threshold Average Pooling (TAP). When adapting their approach for our notation, we receive Equation (4.5) where τtap =λtap max(z~k) with λtap [0,1) as a hyperparameter and ptap k denotes the scalar from the k -th channel of ptap as it is a k -dimensional vector.

根据Bae、Noh和Kim [12]的研究,传统类激活映射存在的问题之一是每个特征图的激活区域因通道不同而异,因为不同通道捕获了不同的类别信息,而这些信息未能在全局平均池化操作中得到恰当反映。由于每个通道都被全局平均,较小的特征激活区域会导致较小的全局平均值,尽管最大激活值相似。这并不一定意味着某个特征对预测更为重要,而可能仅仅是由大面积小激活引起。为解决此问题,通常在比较两个通道时,权重 wk,c 对应较小值的通道会被训练得更高[12]。他们提出用阈值平均池化(TAP)替代全局平均池化。将其方法适配到我们的符号中,得到公式 (4.5),其中 τtap =λtap max(z~k) 是超参数,取值为 λtap [0,1)ptap k 表示 ptap 中第 k 通道的标量,因为它是一个 k 维向量。

(4.5)ptap k=i=1Hzj=1Wz1(z~i,jk>τtap )z~i,jki=1Hzj=1Wz1(z~i,jk>τtap )

When incorporating this into DIvCAM, this results in changing the global average pooling after self-challenging has been applied to a threshold average pooling. Generally, this plug-in replacement can be seen as a trade-off between global max pooling which is better at identifying the important activations of each channel and global average pooling which has the advantage that it expands the activation to broader regions, allowing the loss to backpropagate.

将此方法整合进DIvCAM后,意味着在自我挑战(self-challenging)应用后,将全局平均池化替换为阈值平均池化。总体来看,这种替换可视为全局最大池化和全局平均池化之间的权衡,前者更擅长识别每个通道的重要激活,后者则有利于激活扩展到更广区域,从而使损失能够反向传播。

4.1.2 Smoothing negative Class Activation Maps

4.1.2 平滑负类激活图

Based on the analysis of Sun et al. [180], negative class activation maps, i.e. the class activation maps for classes other than the ground truth, often have false activations even when they are not present in an image. To solve this localization error, they propose a loss function which adds a weighted homogeneous negative CAM (HNC) loss term to the existing Cross-Entropy loss. This is shown in Equation (4.6) where λ1 controls the weight of the additional loss term.

基于Sun等人[180]的分析,负类激活图,即除真实类别外其他类别的激活图,常常在图像中不存在该类别时仍出现错误激活。为解决这一定位误差,他们提出了一种损失函数,在现有的交叉熵损失基础上增加了加权的均匀负类(HNC)损失项。如公式(4.6)所示,其中λ1控制该附加损失项的权重。

(4.6)Lneg =Lce(y,w(z~))+λ1Lhnc (y,M)

Figure 4.2: Used class activation maps for DivCAM-S in update step 300/5000, 2700/5000, and 4500/5000 using a ResNet-50 backbone. For the giraffe,we initially focus on the neck while our masks force the network to also take into consideration the overall shape, finally settling on the torso. For the elephant, we initially focus mainly on the elephant trunk and later guide the network towards taking also the shape into consideration.

图4.2:在更新步骤300/5000、2700/5000及4500/5000中,使用ResNet-50骨干网络的DivCAM-S的类激活图。对于长颈鹿,我们最初关注其脖子,而我们的掩码促使网络也考虑整体形状,最终聚焦于躯干。对于大象,我们最初主要关注象鼻,随后引导网络也考虑整体形状。

Sun et al. [180] propose two approaches for implementing Lhnc  in their work,both operating on the Top- m most confident negative classes. The first one is based on the mean squared error which suppresses peak responses in the CAMs, while the second one utilizes the Kullback-Leibler (KL) divergence trying to minimize the difference between negative CAMs and an uniform probability map. Since they report similar performance for these variants and the KL loss applies a comparably smoother penalty, we use the KL divergence for our method:

Sun等人[180]在其工作中提出了两种实现Lhnc 的方法,均作用于Top-m置信度最高的负类。第一种基于均方误差,抑制CAM中的峰值响应;第二种利用Kullback-Leibler (KL)散度,试图最小化负类CAM与均匀概率图之间的差异。由于他们报告这两种变体性能相近,且KL损失施加的惩罚较为平滑,我们的方法采用KL散度:

(4.7)Lhnc(y,M)=cJ>mDKL(UMc).

Here, J>m is the set of Top- m negative classes with the highest confidence score, URHz×Wz is a uniform probability matrix with all elements having the value (HzWz)1 ,and Mc=σ(Mc) is a probability map produced by applying the softmax function σ to each negative class activation map Mc . Plugging in the definition of the KL divergence and removing the constant as in Equation (4.8) finally results in a simplified version as Equation (4.9).

这里,J>m是置信度最高的Top-m负类集合,URHz×Wz是所有元素值均为(HzWz)1的均匀概率矩阵,Mc=σ(Mc)是通过对每个负类激活图Mc应用softmax函数σ得到的概率图。将KL散度定义代入并去除常数项,如方程(4.8)所示,最终得到简化版本方程(4.9)。

(4.8)DKL(UMc)=i=1Hzj=1WzUi,jlog(Ui,jMc,i,j)= const 1HzWzi=1Hzj=1Wzlog(Mc,i,j)

Generally,with this approach,we add two hyperparametes in the form of the weighting parameter λ and the cut-off number k for the Top- k negative classes.

通常,采用此方法时,我们引入两个超参数:权重参数λ和Top-k负类的截断数k

(4.9)Lhnc (y,M)=1HzWzcJ>mi=1Hzj=1Wzlog(Mc,i,j)

Since we use Grad-CAMs instead of ordinary CAMs in DIvCAM, naïvely applying this would require computing the gradient for every negative class c in the set J>m which would result in computing

由于我们在DivCAM中使用Grad-CAM而非普通CAM,直接应用此方法需要对集合J>m中的每个负类c计算梯度,这将导致计算量巨大。

Equation (4.10) where yc is the confidence of the negative class.

方程(4.10)中,yc表示负类的置信度。

(4.10)Lhnc (y,M)=1HzWzcJ>mi=1Hzj=1Wzlog(σ(max(0,k=1K(1HzWzi=1Hzj=1Wzyczi,jk)kzk)))

To speed up the training for tasks with a large number of classes, we approximate the loss by summing the negative class confidences before backpropagating as shown in Equation (4.11). This amounts to considering all negative classes within J> as one negative class.

为加速具有大量类别任务的训练,我们通过在反向传播前对负类置信度求和来近似损失,如方程(4.11)所示。这相当于将集合J>内的所有负类视为一个负类。

(4.11)L^hnc (y,M)=1HzWzi=1Hzj=1Wzlog(σ(max(0,k=1K(1HzWzi=1Hzj=1WzJzmyczi,jk)kzk)))

To finally implement this into DrvCAM,we simply substitute the current loss Lce(y,w(z~)) in line 12 from Algorithm 2 with Equation (4.6) where Lhnc (y,M) is implemented through our approximation L^hnc (y,M) given in Equation (4.11).

最终,为将此方法实现于DrvCAM,我们仅需在算法2第12行将当前损失Lce(y,w(z~))替换为方程(4.6),其中通过方程(4.11)给出的近似L^hnc (y,M)实现了Lhnc (y,M)

Next, we can try to utilize domain information in DIVCAM by aligning distributions of class activation maps produced by the same class across domains. We want their distributions to align as close as possible such that we cannot identify which domain produced which class activation map. For that, we can utilize some methods previously introduced in Section 2.3.1, in particular we explore minimizing the sample maximum mean discrepancy introduced in Equation (2.7) and using a conditional domain adversarial neural network (CDANN). See Section 5.4.3 for an evaluation of these variants.

接下来,我们可以尝试在DIVCAM中利用领域信息,通过对跨领域同一类别产生的类激活图分布进行对齐。我们希望它们的分布尽可能一致,以致无法识别哪个领域生成了哪张类激活图。为此,我们可以利用第2.3.1节中介绍的一些方法,特别是探索最小化方程(2.7)中提出的样本最大均值差异(MMD)以及使用条件领域对抗神经网络(CDANN)。相关变体的评估见第5.4.3节。

4.1.3 Conditional Domain Adversarial Neural Networks

4.1.3 条件领域对抗神经网络

We combine the domain adversarial neural network (CDANN) approach, originally introduced by Li et al. [116], with DivCAM to align the distributions of CAMs across domains. For that, we try to predict the domain to which a class activation map belongs by passing it to a multi-layer perceptron ω . We compute the cross entropy loss between the predictions and the domain ground truth d and weight it for each sample by the occurrence probability of the respective class. After weighting, we can sum up all the losses and add it to our overall loss,weighted by λ2 :

我们结合了Li等人[116]最初提出的领域对抗神经网络(CDANN)方法与DivCAM,以对齐不同领域间的类激活图(CAMs)分布。为此,我们尝试通过将类激活图输入多层感知机ω来预测其所属领域。我们计算预测结果与领域真实标签之间的交叉熵损失d,并根据各样本对应类别的出现概率对其加权。加权后,我们将所有损失求和,并乘以权重λ2后加入整体损失中:

(4.12)Ladv=Lce(y,w(z~))+λ2(Lce(d,ω(M))+ηMLce(d,ω(M))2).

During each training step, we either update the discriminator, i.e. the predictor for the domain, or the generator, i.e. the main network including featurizer and classifier. The discriminator loss inherently includes a 2 penalty on the gradients,weighted by η .

在每次训练步骤中,我们要么更新判别器,即领域预测器,要么更新生成器,即包含特征提取器和分类器的主网络。判别器损失本质上包含一个对梯度的2惩罚项,权重为η

4.1.4 Maximum Mean Discrepancy

4.1.4 最大均值差异

Given two samples xξ1 and xξ2 drawn from two individual,unknown domain distributions Dξ1 and Dξ2 ,the maximum mean discrepancy (MMD) is given by Equation (4.13) where φ:RdH is a feature map and k(,) is the kernel function induced by φ() . We consider every distinct pair of source domains (ξu,ξv) ,representing training domains ξu and ξv ,with ξuξv to be in the set P .

给定两个样本xξ1xξ2,分别从两个未知的独立领域分布Dξ1Dξ2中抽取,最大均值差异(MMD)由公式(4.13)给出,其中φ:RdH是特征映射,k(,)是由φ()诱导的核函数。我们考虑所有不同的源领域对(ξu,ξv),代表训练领域ξuξv,其中ξuξv属于集合P

(4.13)Ldist =ξu,ξvPExξuDξu[φ(ϕ(xξu))]ExξvDξv[φ(ϕ(xξv))]H

Figure 4.3: Domain-agnostic Prototype Network

图4.3:领域无关原型网络

In simpler terms,we map features into a reproducing kernel Hilbert space H ,and compute their mean differences within the RKHS. This loss pushes samples from different domains, which represent the same class, to lie nearby in the embedding space. According to Sriperumbudur et al. [176], this mean embedding is injective, i.e. arbitrary distributions are uniquely represented in the RKHS, if we use a characteristic kernel. For this work, we choose the gaussian kernel shown in Equation (4.14) which is a well-known characteristic kernel.

简单来说,我们将特征映射到再生核希尔伯特空间(RKHS)H,并计算它们在RKHS中的均值差异。该损失促使来自不同领域但属于同一类别的样本在嵌入空间中靠近。根据Sriperumbudur等人[176]的研究,如果使用特征核(characteristic kernel),该均值嵌入是单射的,即任意分布在RKHS中有唯一表示。本文选用公式(4.14)中展示的高斯核,这是一种著名的特征核。

(4.14)k(x,x)=exp(xx22σ2)

Since the choice of kernel function can have a significant impact on the distance metric, we adopt the approach of Li et al. [112] and use a mixture kernel by averaging over multiple choices of σ as already implemented in DomAINBED. This gets incorporated into our loss function weighted by λ3 with:

由于核函数的选择对距离度量有显著影响,我们采用Li等人[112]的方法,通过对多种σ取平均形成混合核,正如DOMAINBED中已实现的那样。该方法被纳入我们的损失函数,并以权重λ3加权:

(4.15)Lmmd =Lce(y,w(z~))+λ3Ldist .

With this approach, we inherently align the computed masks by aligning the individual samples from different domains, aiming at producing domain invariant masks. This procedure can be applied at different levels e.g. on the features, class activation maps, or masked class activation maps. In Section 5.4.3, we only provide results for the feature level due to the effectiveness that similar approaches showed in DOMAINBED. However, we observe a similar trend for the other application levels as well.

通过该方法,我们通过对不同领域的单个样本进行对齐,内在地实现了计算掩码的对齐,旨在生成领域不变的掩码。该过程可应用于不同层级,例如特征层、类激活图层或掩码类激活图层。在第5.4.3节中,我们仅提供特征层的结果,因类似方法在DOMAINBED中表现出较高效能。然而,我们也观察到其他应用层级呈现类似趋势。

4.2 Prototype Networks for Domain Generalization

4.2 用于领域泛化的原型网络

Another approach to combine explainability methods with the task of domain generalization is to use the prototype method outlined in Section 3.2.3. In particular, we can directly adapt the approach of Chen et al. [32] as a baseline where we associate each class with a pre-defined number of prototypes. The cluster and separation losses from Equation (3.6) ensure that each prototype resembles a prototypical attribute for the associated class and we minimize them according to Equation (3.5).

将可解释性方法与领域泛化任务结合的另一种方法是使用第3.2.3节中概述的原型方法。具体而言,我们可以直接采用Chen等人[32]的方法作为基线,将每个类别关联一定数量的预定义原型。公式(3.6)中的聚类和分离损失确保每个原型都代表该类别的典型属性,并根据公式(3.5)对其进行最小化。

For our application scenario,this prototype layer is used after the domain-agnostic featurizer ϕ and operates on the features like illustrated in Figure 4.3. As each prototype is trained with data from all training domains Ξ ,they become inherently domain agnostic. This baseline uses a joint classifier w to output the final prediction which operates on the maximum similarity for all the prototypes and some latent patch. Similar to Chen et al. [32], we preface the prototype layer with two convolutional layers with kernel size 1, a ReLU function between them, and finally a sigmoid activation function. We observe that having roughly 100 initial update steps where only these in-between layers are trained is crucial for competitive performance. We anticipate that these steps are used to adapt the randomly-initialized convolutional weights to the image statics imposed by the pre-trained backbone. 1 While this baseline is meaningful, we also consider a second variant where we build an ensemble of prototype layers, each learning domain-specific prototypes.

对于我们的应用场景,该原型层在领域无关特征提取器ϕ之后使用,并在如图4.3所示的特征上操作。由于每个原型都是用所有训练领域的数据Ξ训练的,因此它们本质上是领域无关的。该基线使用联合分类器w输出最终预测,该分类器基于所有原型和某些潜在补丁的最大相似度进行操作。类似于Chen等人[32]的做法,我们在原型层前置了两个卷积层,卷积核大小为1,中间夹有ReLU激活函数,最后是sigmoid激活函数。我们观察到,初始约100步仅训练这两个中间层的更新步骤对于获得竞争性能至关重要。我们推测这些步骤用于将随机初始化的卷积权重适应预训练骨干网络施加的图像统计特性。1虽然该基线具有意义,我们还考虑了第二种变体,即构建原型层的集成,每个原型层学习特定领域的原型。

Figure 4.4: Ensemble Prototype Network

图4.4:集成原型网络

4.2.1 Ensemble Prototype Network

4.2.1 集成原型网络

Following the intuition provided by works that utilize model ensembling, which have been described in Section 2.3.2, we can use domain information by up-scaling the network to use a prototype layer for each domain separately. For the PACS dataset, this would correspond to having three prototype layers, one for each training domain e.g. a photo, art, and cartoon prototype layer when predicting sketch images. Each prototype layer is only trained with images from their corresponding domain.

基于第2.3.2节中描述的利用模型集成的直觉,我们可以通过扩展网络为每个领域单独使用一个原型层来利用领域信息。对于PACS数据集,这相当于拥有三个原型层,每个训练领域一个,例如在预测素描图像时分别有照片、艺术和卡通原型层。每个原型层仅用其对应领域的图像进行训练。

As shown in Figure 4.4, we associate each domain with both a prototype layer and a classifier which takes similarity scores of that domain's prototypes as input. During training, we only feed images of the associated domain to the respective prototype layer and classifier to enforce this domain correspondence. The aggregation weights of the final linear layer are set to a one-hot encoding representing the correct domain. During testing, we can then feed the new unseen domain to each domain prototype layer, allowing each domain's prototypes to influence the final prediction.

如图4.4所示,我们为每个领域关联一个原型层和一个分类器,分类器以该领域原型的相似度分数作为输入。训练时,我们仅将对应领域的图像输入相应的原型层和分类器,以强制实现领域对应关系。最终线性层的聚合权重设置为表示正确领域的独热编码。测试时,我们可以将新的未见领域输入每个领域的原型层,使每个领域的原型都能影响最终预测。

There exist multiple strategies for setting the aggregation weights during this stage. The most simple version is to set the influence of each domain uniform, i.e. if we have three training domains the connections from each domain would have the weight 13 such that each domain has the same influence on the final prediction. Our second approach is to jointly train a domain predictor to output the weights for the aggregation layer which can either be used both, during training and testing, or only during testing, similar to what is done by Mancini et al. [124]. This method allows for a more flexible aggregation of the separated predictions coming from the different prototype layers, enabling the network to put more emphasis on the relevant domain prototypes.

在此阶段设置聚合权重存在多种策略。最简单的版本是均匀设置每个领域的影响力,即如果有三个训练领域,则每个领域的连接权重为13,使每个领域对最终预测的影响相同。第二种方法是联合训练一个领域预测器,输出聚合层的权重,该权重可在训练和测试期间均使用,或仅在测试期间使用,类似于Mancini等人[124]的做法。该方法允许更灵活地聚合来自不同原型层的分离预测,使网络能更强调相关领域的原型。

Lastly, we also experiment with an ensemble variant that is specific to prototype layers. Instead of only pre-defining a number of prototypes for each class in the domain agnostic prototype layer outlined in Section 4.2, we can also pre-define domain correspondence for each prototype. Training is then done by passing each domain separately to the prototype layer and masking the overall prototype outputs if they do not correspond to the current environment. The cluster and separation losses are adapted to an average over the individual environments as:

最后,我们还尝试了一个特定于原型层的集成变体。与仅在第4.2节中定义的领域无关原型层为每个类别预定义原型数量不同,我们还可以为每个原型预定义领域对应关系。训练时,将每个领域单独传入原型层,并在不对应当前环境时屏蔽整体原型输出。聚类和分离损失调整为对各个环境的平均,如下所示:

(4.16)Lclst =1sξΞ1nξi=1nξminj:pjPyiξminz¯ patches (zξ)zpj22

(4.17)Lsep =1sξΞ1nξi=1nξminj:pjPyiξminz¯patches(zξ)zpj22,


1 Further implementation details can be found here: https://github.com/SirRob1997/DomainBed/

1 更多实现细节见:https://github.com/SirRob1997/DomainBed/


where Pyiξ denotes the prototypes associated with the specific class and environment while zξ denotes the latent representation of an image corresponding to the current domain. This ensemble variant inherently removes the need for setting appropriate aggregations weights as prototype activations are simply masked during training while during testing all prototypes are kept, allowing each prototype from each source domain to influence the prediction. With a domain specific cross entropy loss:

其中Pyiξ表示与特定类别和环境相关联的原型,zξ表示对应当前领域的图像潜在表示。该集成变体本质上消除了设置适当聚合权重的需求,因为训练时原型激活被屏蔽,而测试时保留所有原型,使每个源领域的原型都能影响预测。采用领域特定的交叉熵损失:

(4.18)Lce=1sξΞ1nξi=1nξc=1Cyi,cξlog(y^i,cξ),

we optimize the final loss of our ensemble model as:

我们优化集成模型的最终损失为:

(4.19)L=Lce+λclst Lclst +λsep Lsep .

While all of these ensemble variants should work given the intuition from previous works that have been using model ensembles for learned domain-specific latent spaces, we observe that these assumptions do not hold for prototype networks based on our additional experiments of the proposed variants. For us, any prototype ensemble was consistently outperformed by one domain-agnostic prototype layer.

尽管基于先前使用模型集成学习领域特定潜在空间的工作直觉,所有这些集成变体理论上都应有效,但根据我们对所提变体的额外实验观察,这些假设并不适用于原型网络。对我们而言,任何原型集成都始终被单一领域无关原型层所超越。

4.2.2 Diversified Prototypes (ProDrop)

4.2.2 多样化原型(ProDrop)

Initial experiments with both domain-agnostic and domain-specific prototype layers lead to unsatisfactory results. To investigate this behavior,we analyze the pairwise prototype 2 -distance as well as the cosine-distance ϱ which for any two prototypes pi and pj are given by:

对领域无关和领域特定原型层的初步实验结果均不理想。为探究此现象,我们分析了成对原型的2距离以及余弦距离ϱ,对于任意两个原型pipj,其定义如下:

(4.20)2=pipj2

(4.21)ϱ=1pipjpi2pj2.

Through the 2 -distance we can grasp the euclidean distance between any two prototypes while the cosine-distance ϱ[0,2] is the inverted version of the cosine similarity which is a metric to judge the cosine of the angle between them. Here, we visualize the cosine-distance instead of the cosine similarity to match the color-scheme of the 2 -distance i.e. low values resemble closeness.

通过2距离,我们可以掌握任意两个原型之间的欧几里得距离,而余弦距离ϱ[0,2]是余弦相似度的反向版本,余弦相似度是衡量它们之间夹角余弦的度量。这里,我们可视化余弦距离而非余弦相似度,以匹配2距离的配色方案,即低值表示接近。

The results of this analysis can be seen in Figure 4.5 and Figure 4.6 for the first data split and a negative weight of wc,j=1.0j:pjPc but we also show the same plots for wc,j=0.0 and the two additional data splits in Appendix B. As both of the used metrics are symmetric, only the upper triangle is visualized. We observe that depending on the data split and negative weight, the model sometimes converges to having one or two prototypes per class that have a large 2 distance while many of the other prototypes are close. Striking examples for this behavior can be seen in Figure B. 1 for the sketch domain or in Figure 4.5 for the sketch and cartoon domain. This observation suggests, that in these scenarios the model fails to properly use all of the available prototypes and only relies on a significantly reduced subset per class, not training the other prototypes. In relation to the cosine-distance, however, we can often observe that exactly these prototypes with a high pairwise 2 -distance to all the other prototypes have a slightly lower cosine-distance. Such behavior can be seen for example in Figure B. 1 in the art and cartoon environment or in Figure 4.5 for the sketch and cartoon domain. For the most part, many cosine-distances tend to be low and more or less uniformly spaced. Nevertheless, we can occasionally identify "streaking" patterns in the cosine-distances where prototypes for certain classes are well-spaced but they have a larger (or equal) cosine-distance to prototypes of other classes. See for example Figure B. 1 for the sketch domain, Figure B. 6 for the sketch and photo domain, and Figure B. 7 for the sketch and cartoon domain.

该分析结果可见于图4.5和图4.6,针对第一数据划分和负权重wc,j=1.0j:pjPc,但我们也在附录B中展示了针对wc,j=0.0及另外两个数据划分的相同图表。由于所用的两个度量均为对称的,仅可视化了上三角部分。我们观察到,依据数据划分和负权重,模型有时会收敛为每个类别拥有一到两个原型,其2距离较大,而其他许多原型则较为接近。此行为的典型例子见于附录B.1中素描领域,或图4.5中素描和卡通领域。该观察表明,在这些场景中,模型未能充分利用所有可用原型,而仅依赖每类显著减少的子集,未训练其他原型。关于余弦距离,我们常观察到正是这些与其他原型成对2距离较大的原型,其余弦距离略低。例如,附录B.1中艺术和卡通环境,或图4.5中素描和卡通领域均可见此行为。大多数情况下,许多余弦距离趋于较低且较均匀分布。然而,我们偶尔能识别出余弦距离中的“条纹”模式,即某些类别的原型间间距良好,但它们与其他类别原型的余弦距离较大(或相等)。例如,见附录B.1素描领域,附录B.6素描与照片领域,以及附录B.7素描与卡通领域。

Figure 4.5: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=1.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. No self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. Second data split.

图4.5:最佳表现模型在每个测试域中,负权重wc,j=1.0j:pjPc下的成对学习原型2距离(上)和余弦距离ϱ(下)。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。未应用自我挑战,且为可视化目的对每个度量的色图边界进行了调整。第二数据划分。

From a design standpoint, we would like the prototypes within each class to be reasonably well spaced out in the latent space such that they can resemble different discriminative attributes about each class. That is, we would like the network to utilize all the prototypes and not only rely on a small subset of prototypes or discriminative features for their prediction. The distances of these prototypes to the prototypes of the other classes, however, should not be restrained in any way and should be learned automatically. For example, when predicting different bird species, this allows the network to place similar head prototypes of different classes closer together. The existing cluster and separation losses enforce that each prototype associated with that class is close to at least one latent patch of that class while maximizing the distance to the prototypes of other classes. However, this does not enforce that each prototype associated with that class acts on a different discriminative feature.

从设计角度来看,我们希望每个类别内的原型在潜在空间中合理分布,以便它们能代表该类别的不同判别属性。也就是说,我们希望网络利用所有原型,而非仅依赖少数原型或判别特征进行预测。然而,这些原型与其他类别原型之间的距离不应受到任何限制,应自动学习。例如,在预测不同鸟类时,这允许网络将不同类别中相似的头部原型放得更近。现有的聚类和分离损失确保与该类别相关的每个原型至少靠近该类别的一个潜在补丁,同时最大化与其他类别原型的距离。但这并不强制每个与该类别相关的原型作用于不同的判别特征。

One approach to possibly enforce this behavior is to incorporate the self-challenging method previously applied to DIVCAM to the presented prototype network resulting in a novel algorithm we call prototype dropping (PRODROP) which is described in Algorithm 3. In essence, we extract features by passing our input images to the featurizer z=ϕ(x) and compute the similarity scores for each prototype by passing it to the prototype layer gpj(z) with Equation (3.4). Based on these similarity scores,we compute a mask mc,j for the prototypes of the respective class c with the Top- p highest activation:

一种可能强制实现此行为的方法是将先前应用于DIVCAM的自我挑战方法整合到所提出的原型网络中,形成一种我们称之为原型丢弃(PRODROP)的新算法,详见算法3。本质上,我们通过将输入图像传入特征提取器z=ϕ(x)提取特征,并通过原型层gpj(z)利用公式(3.4)计算每个原型的相似度分数。基于这些相似度分数,我们为相应类别c的原型计算掩码mc,j,选取激活值最高的Top-p

(4.22)mc,j={0, if gpj(z)qc,pj:pjPc1, otherwise 

Algorithm 3: Prototype Dropping (ProDrop)

算法3:原型丢弃(ProDrop)


Input: Data X,Y with xiRH×W×3 ,drop factor p,b ,epochs T

输入:数据 X,Y ,包含 xiRH×W×3 ,丢弃因子 p,b ,训练轮数 T

while epoch T do

当 epoch T

for every batch \( \mathbf{x},\mathbf{y} \) do
对每个批次 \( \mathbf{x},\mathbf{y} \) 执行
	Extract features \( \mathbf{z} = \phi \left( \mathbf{x}\right) \) // \( \mathbf{z} \) has shape \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  K} \)
	提取特征 \( \mathbf{z} = \phi \left( \mathbf{x}\right) \) // \( \mathbf{z} \) 的形状为 \( {\mathbb{R}}^{{H}_{\mathbf{z}} \times  {W}_{\mathbf{z}} \times  K} \)
	Compute \( {g}_{{\mathbf{p}}_{j}}\left( \mathbf{z}\right) \) with Equation (3.4)
	使用公式(3.4)计算 \( {g}_{{\mathbf{p}}_{j}}\left( \mathbf{z}\right) \)
	Compute \( {\mathbf{m}}_{c,j} \) with Equation (4.22)
	使用公式(4.22)计算 \( {\mathbf{m}}_{c,j} \)
	Adapt \( {\mathbf{m}}_{c,j} \) with Equation (4.4)
	使用公式(4.4)调整 \( {\mathbf{m}}_{c,j} \)
	Compute \( {\widetilde{g}}_{\mathbf{p}}\left( \mathbf{z}\right) \) with Equation (4.23)
	使用公式(4.23)计算 \( {\widetilde{g}}_{\mathbf{p}}\left( \mathbf{z}\right) \)
	Backpropagate loss \( {\mathcal{L}}_{ce}\left( {w\left( {{\widetilde{g}}_{\mathbf{p}}\left( \mathbf{z}\right) }\right) ,\mathbf{y}}\right)  + {\lambda }_{4}{\mathcal{L}}_{\text{clst }} + {\lambda }_{5}{\mathcal{L}}_{\text{sep }} \)
	反向传播损失 \( {\mathcal{L}}_{ce}\left( {w\left( {{\widetilde{g}}_{\mathbf{p}}\left( \mathbf{z}\right) }\right) ,\mathbf{y}}\right)  + {\lambda }_{4}{\mathcal{L}}_{\text{clst }} + {\lambda }_{5}{\mathcal{L}}_{\text{sep }} \)
end
结束

end

结束


where qc,p is the corresponding threshold value. We also apply the mask batching from DivCAM without scheduling which only applies this type of masking for the highest confidence samples on the ground truth. Finally,we can mask the samples using the Hadamard product with:

其中 qc,p 是对应的阈值。我们还采用了来自 DivCAM 的批量掩码方法,无需调度,仅对真实标签中置信度最高的样本应用此类掩码。最后,我们可以使用哈达玛积 对样本进行掩码处理,公式如下:

(4.23)g~p(z)=mgp(z).

In practice, all of these operations can be efficiently implemented using torch.quantile, torch.lt, and torch.logical_or on small tensors, all of which pose no significant computational overhead.

在实际操作中,所有这些操作都可以通过 torch.quantile、torch.lt 和 torch.logical_or 在小张量上高效实现,且不会带来显著的计算开销。

The effect of this approach on the pairwise prototype distances can be seen in Figure B.3 as well as Appendix B. We observe, that even though self-challenging helps to boost the overall performance (see Section 5.4.4), it does not particularly well achieve the previously described desired distance properties and only improves them marginally. Positive effects can be seen in Figure 4.6 for the sketch and cartoon domain, Figure B. 10 for the sketch domain, or Figure B. 9 for the cartoon domain.

该方法对成对原型距离的影响可见于图B.3及附录B。我们观察到,尽管自我挑战(self-challenging)有助于提升整体性能(参见第5.4.4节),但其并未很好地实现前述期望的距离特性,仅有边际改进。正面效果可见于图4.6(草图和卡通领域)、图B.10(草图领域)以及图B.9(卡通领域)。

Our second approach to enforce the desired distance structures is to add an additional intra-class prototype loss term Lintra  which maximizes the intra-class prototoype 2 - and/or cosine-distance weighted by λ6 . Again,this loss term can in theory have a few different definitions depending on the chosen distance metrics, we experiment with:

我们加强期望距离结构的第二种方法是添加额外的类内原型损失项 Lintra ,该项最大化类内原型 2 和/或由 λ6 加权的余弦距离。同样,该损失项理论上可根据所选距离度量有多种定义,我们尝试了以下几种:

(4.24)Lintra =pi,pjPcλ2pipj22 distance +λϱ(1pipjpi2pj2)cosine-distance ,

where the 2 -distance and the cosine-distance are weighted by λ2 and λϱ respectively. Performance results for the loss presented in Equation (4.24) with λ2=1 and λϱ=1 are shown in Section 5.4.5. Since the cosine-distance is bounded,this commonly amounts to the 2 -distance having a higher influence. We also experimented with other values for λ2 and λϱ ,such as setting either one of them to zero and only applying either the 2 - or the cosine-distance (λ2=0,λϱ=1 and λ2=1,λϱ=0) , but couldn't find any further benefits by canceling or re-weighting them differently.

其中2距离和余弦距离分别由λ2λϱ加权。方程(4.24)中损失函数的性能结果,结合λ2=1λϱ=1,见第5.4.5节。由于余弦距离有界,这通常导致2距离的影响更大。我们还尝试了其他λ2λϱ的取值,比如将其中一个设为零,仅应用2距离或余弦距离的(λ2=0,λϱ=1λ2=1,λϱ=0),但未发现通过取消或重新加权能带来额外收益。

Influence of the negative weight on the distance metrics We also analyze the discrepancies between the distance metrics when comparing wc,j=1.0j:pjPc and wc,j=0.0j:pjPc which can be seen in the figures presented in Appendix B. However, from these plots there were no consistent trends observable which can be made on how the negative weight influences the training behavior of the prototypes.

负权重对距离度量的影响 我们还分析了比较wc,j=1.0j:pjPcwc,j=0.0j:pjPc时距离度量之间的差异,这些差异可见于附录B中的图表。然而,从这些图中未观察到负权重如何影响原型训练行为的稳定趋势。

Figure 4.6: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=1.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. Self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. Second data split.

图4.6:最佳模型在每个测试域中负权重为wc,j=1.0j:pjPc时的成对学习原型2距离(上图)和余弦距离ϱ(下图)。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。应用了自我挑战机制,且为可视化目的对每种度量的色彩图界限进行了调整。第二数据划分。

4.2.3 Using Support Sets (D-Transformers)

4.2.3 使用支持集(D-Transformers)

Instead of directly learning the set of prototypes P ,we can rely on a support set similar to what is done by Doersch, Gupta, and Zisserman [43] or Snell, Swersky, and Zemel [173]. Here, the prototypes are based on a support set S consisting of nc sample images xi . This support set exists for each of the classes c as Sc={xi,c}i=1nc where xi,c is an image from class c . In the classical setting,the set of prototypes for each class Pc is then obtained by computing one prototype for each class pj,c averaging the average-pooled latent representations of the support set as:

我们不直接学习原型集合P,而是依赖类似于Doersch、Gupta和Zisserman [43]或Snell、Swersky和Zemel [173]所采用的支持集。这里,原型基于支持集S,该支持集包含nc张样本图像xi。该支持集针对每个类别c存在,表示为Sc={xi,c}i=1nc,其中xi,c是类别c中的一张图像。在经典设置中,每个类别Pc的原型集合通过计算每个类别pj,c的一个原型获得,即对支持集的平均池化潜在表示求平均,公式如下:

(4.25)pj,c=1|Pc|xi,cScϕ(xi,c)

Contrary to the previous approach, by averaging all the average-pooled latent representations, there exists only one prototype for each class i.e. |Pc|=1 and j=1 which loses spatial information. Predictions are again made by computing some distance function between the prototypes and the latent representation of the image to be classified. Doersch, Gupta, and Zisserman [43] preserve this spatial information of the feature extractor and use the attention weights to guide the averaging across support set images and latent patches in a method that they call CROSSTRANSFORMERS. We extend their approach for the domain generalization case here, where we compute attention across multiple training environments. In the following sections this adaptation is referenced as D-TRANSFORMERS.

与之前的方法相反,通过对所有平均池化潜在表示求平均,每个类别仅存在一个原型,即|Pc|=1j=1,这会丢失空间信息。预测仍通过计算原型与待分类图像潜在表示之间的某种距离函数进行。Doersch、Gupta和Zisserman [43]保留了特征提取器的空间信息,并利用注意力权重引导对支持集图像和潜在补丁的平均,这种方法称为CROSSTRANSFORMERS。我们在此将其方法扩展到域泛化场景,计算多个训练环境间的注意力。以下章节中,该改编称为D-TRANSFORMERS。

In particular, similar to Transformers [189] and the original idea by Doersch, Gupta, and Zisserman [43], we use three linear transformations to compute keys, values, and queries. In practice, the key head Γ:RKRdk ,the value head Λ:RKRdv ,and the query head Ω:RKRdk can each be implemented through convolutions with kernel_size = 1 . Prototypes for each domain are computed by passing the support set Scξ={xi,cξ}i=1nξ,c associated with domain ξ and class c through the feature extractor. For each spatial location m in the resulting features (indexed over Hz×Wz ) of the support set image i ,we compute dot-product attention scores between keys ki,m,cξ=Γ(ϕ(xi,cξ))m and the query vectors qpξ=Ω(ϕ(xqξ))p for a query image xq at spatial location p (indexed over Hz×Wz ). Explicitly, the dot similarity αi,m,c,pξ between them is computed as:

特别地,类似于Transformers [189]和Doersch、Gupta及Zisserman [43]的原始思路,我们使用三个线性变换来计算键(keys)、值(values)和查询(queries)。在实际操作中,键头Γ:RKRdk、值头Λ:RKRdv和查询头Ω:RKRdk均可通过卷积核大小为1的卷积实现。每个域的原型通过将与域ξ和类别c相关联的支持集Scξ={xi,cξ}i=1nξ,c输入特征提取器计算得到。对于支持集图像i的结果特征中每个空间位置m(索引范围为Hz×Wz),我们计算键ki,m,cξ=Γ(ϕ(xi,cξ))m与查询图像xq在空间位置p(索引范围为Hz×Wz)的查询向量qpξ=Ω(ϕ(xqξ))p之间的点积注意力分数。具体地,它们之间的点积相似度αi,m,c,pξ计算如下:

(4.26)αi,m,c,pξ=ki,m,cξqpξ.

Afterwards,we re-scale this dot similarity by using τ=dk and obtain the final attention weights α~i,m,c,pξ using a softmax operation summing over all spatial locations and images in the support set:

随后,我们使用τ=dk对该点积相似度进行重新缩放,并通过对支持集中所有空间位置和图像求和的softmax操作,得到最终的注意力权重α~i,m,c,pξ

(4.27)α~i,m,c,pξ=exp(αi,m,c,pξ/τ)i,mexp(αi,m,c,pξ/τ).

Finally,we can use the support-set values vi,m,cξ=Λ(ϕ(xi,cξ))m with the attention weights to compute a prototype vector per spatial location p for each domain and class:

最后,我们可以利用支持集的值vi,m,cξ=Λ(ϕ(xi,cξ))m和注意力权重,为每个域和类别的每个空间位置p计算一个原型向量:

(4.28)pp,cξ=i,mα~i,m,c,pξvi,m,cξ.

As a distance, we compute the dot similarity between the prototype and the query image values wpξ=Λ(ϕ(xi,cξ))p where we sum over the training environments ξΞ and the spatial locations p of

作为距离度量,我们计算原型与查询图像值wpξ=Λ(ϕ(xi,cξ))p之间的点积相似度,并对训练环境ξΞ和空间位置p进行求和,

the query image:

查询图像:

(4.29)dist(xi,c,Sc)=ξΞ1HzWzppp,cξwpξ.

Most notably,we deploy the dot similarity as a distance metric here instead of the squared 2 -norm used by Doersch, Gupta, and Zisserman [43] since we observe numerical instability for that in our experiments which might be due to the distribution shift in domain generalization. In their work, they reason about using the same value-head Λ for the queries as used on the support-set images to ensure that the architecture works as a distance i.e. if the support set contains the same images as the query, they want the euclidean distance to be 0 [43]. Because we observe no performance gains for keeping them separate,we also use the same value head Λ for both the queries and support set images to reduce additional parameters to a minimum.

最值得注意的是,我们这里采用点积相似度作为距离度量,而非Doersch、Gupta和Zisserman [43]使用的平方2-范数,因为我们在实验中观察到后者存在数值不稳定性,这可能是由于域泛化中的分布偏移所致。在他们的工作中,使用与支持集图像相同的值头Λ来处理查询,是为了确保架构作为距离度量的有效性,即当支持集包含与查询相同的图像时,他们希望欧氏距离为0 [43]。鉴于我们未观察到将两者分开带来的性能提升,我们也对查询和支持集图像使用相同的值头Λ,以将额外参数降至最低。

Chapter 5

第五章

Experiments

实验

In an effort to improve comparability and reproducability, we use DOMAINBED [73] for all ablation studies and experimental results. We compare to all methods currently in DOMAINBED which includes our provided implementation of RSC. Further, all experiments show results for both training-domain validation which assumes that training and testing domains have similar distributions and oracle validation which has limited access to the testing domain. We omit leave-one-domain-out cross-validation as it requires the most computational resources and performs the worst out of the three validation techniques currently available in DomainBED [73].

为了提升可比性和可复现性,我们在所有消融研究和实验结果中均使用DOMAINBED [73]。我们比较了DOMAINBED中所有现有方法,包括我们提供的RSC实现。此外,所有实验均展示了训练域验证结果(假设训练和测试域分布相似)和oracle验证结果(有限访问测试域)。我们省略了留一域交叉验证,因为其计算资源需求最高且在DomainBED [73]中三种验证技术中表现最差。

5.1 Datasets and splits

5.1 数据集与划分

Since the size of the validation dataset can have a heavy impact on performance, we follow the design choices of DOMAINBED and choose 20% of each domain as the validation size for all experiments and ablation studies. Here, we present results for VLCS [52], PACS [107], Office-Home [190], Terra Incognita [17], and DomainNet [141]. Although sometimes disregarded in the body of literature for domain generalization, we provide results for three different dataset splits to assess the stability regarding the model selection and to avoid overfitting on one split.

由于验证集的大小会对性能产生重大影响,我们遵循DOMAINBED的设计选择,选择每个域的20%作为所有实验和消融研究的验证集大小。这里,我们展示了VLCS [52]、PACS [107]、Office-Home [190]、Terra Incognita [17]和DomainNet [141]的数据结果。尽管在领域泛化的文献中有时被忽视,我们提供了三种不同数据集划分的结果,以评估模型选择的稳定性并避免对单一划分的过拟合。

5.2 Hyperparameter Distributions & Schedules

5.2 超参数分布与调度

For the main results, we use the official DOMAINBED hyperparameter distributions as well as the ADAM optimizer with no learning rate schedule. This corresponds to the setup in which all baselines from Table 5.2 have been evaluated in by Gulrajani and Lopez-Paz [73] to provide a fair comparison. For the hyperparameters that get introduced in our methods, we choose similar distributions as used in the ablation studies from Section 5.4. Further, Section 5.4 also shows the official distributions of DOMAINBED for all other shared hyperparameters such as learning rate α or batch size B .

对于主要结果,我们使用官方DOMAINBED的超参数分布以及无学习率调度的ADAM优化器。这对应于Gulrajani和Lopez-Paz [73]在表5.2中评估所有基线时所采用的设置,以确保公平比较。对于我们方法中新引入的超参数,我们选择与第5.4节消融研究中使用的类似分布。此外,第5.4节还展示了DOMAINBED官方对所有其他共享超参数(如学习率α或批量大小B)的分布。

5.3 Results

5.3 结果

The high-level results for DIvCAM-S inside the DomainBED framework and across datasets are shown in Table 5.2. For completeness, we also show results outside of DOMAINBED on the official PACS split in Table 5.3 using a ResNet-18 backbone. The full results, including the performance for choosing any domain inside each dataset as a testing domain, are shown in Appendix A. While we are able to achieve state-of-the-art performance outside of the DOMAINBED framework, utilizing learning rate schedules and hyperparameter fine-tuning, we are also able to achieve Top-4 performance across five datasets within DOMAINBED. Notably, we outperform RSC in almost every scenario, while exhibiting less standard deviation. This leads to more stable results and a more suitable method which can easily be used as a plug-and-play approach. In comparison with all other methods, we achieve good performance for the two most challenging datasets, namely Terra Incognita (Top-2) and DomainNet (Top-4), outperforming RSC for both datasets by up to 2%. This suggests, that directly reconstructing to Grad-CAMs in DIVCAM-S specifically provides value for the more challenging datasets in DOMAINBED. Keep in mind, that this is possible without adding any additional parameters and while providing a framework where intermediate class activation maps can be visualized that guide the networks training process. While this might not be the best explainability method, it certainly can offer more insights than treating the optimization procedure as a block-box without this guidance.

DIvCAM-S在DomainBED框架内及跨数据集的高层次结果如表5.2所示。为完整起见,我们还在表5.3中展示了使用ResNet-18骨干网络,在官方PACS划分外的结果。完整结果,包括选择每个数据集内任一域作为测试域的性能,见附录A。虽然我们能够在DOMAINBED框架外通过利用学习率调度和超参数微调实现最先进性能,但在DOMAINBED内,我们也能在五个数据集中实现Top-4性能。值得注意的是,我们几乎在所有场景中均优于RSC,且表现出更小的标准差。这带来了更稳定的结果和更适合即插即用的方法。与其他方法相比,我们在两个最具挑战性的数据集Terra Incognita(Top-2)和DomainNet(Top-4)上表现良好,分别超越RSC最多达2%。这表明,DIVCAM-S中直接重建至Grad-CAM(梯度加权类激活映射)对DOMAINBED中更具挑战性的数据集特别有价值。请注意,这在不增加任何额外参数的情况下实现,同时提供了一个可以可视化中间类激活图的框架,指导网络的训练过程。虽然这可能不是最佳的可解释性方法,但肯定比将优化过程视为黑箱且无此指导更具洞察力。

AlgorithmPACSAvg.
DivCAM-S\( {94.4} \pm {0.7} \)\( {80.5} \pm {0.4} \)\( {74.6} \pm {2.2} \)\( {79.0} \pm {0.9} \)\( {82.1} \pm {0.3} \)
ProDrop\( {93.6} \pm {0.6} \)\( {82.1} \pm {0.9} \)\( {76.4} \pm {0.9} \)\( {76.3} \pm {0.6} \)\( {82.1} \pm {0.6} \)
D-TRANSFORMERS\( {93.7} \pm {0.6} \)\( {80.1} \pm {1.2} \)\( {73.3} \pm {1.3} \)\( {72.6} \pm {1.5} \)\( {79.9} \pm {0.1} \)
算法PACS平均值
DivCAM-S\( {94.4} \pm {0.7} \)\( {80.5} \pm {0.4} \)\( {74.6} \pm {2.2} \)\( {79.0} \pm {0.9} \)\( {82.1} \pm {0.3} \)
ProDrop\( {93.6} \pm {0.6} \)\( {82.1} \pm {0.9} \)\( {76.4} \pm {0.9} \)\( {76.3} \pm {0.6} \)\( {82.1} \pm {0.6} \)
D-TRANSFORMERS\( {93.7} \pm {0.6} \)\( {80.1} \pm {1.2} \)\( {73.3} \pm {1.3} \)\( {72.6} \pm {1.5} \)\( {79.9} \pm {0.1} \)

Table 5.1: Performance comparison of the proposed methods for the PACS dataset with a ResNet-18 backbone.

表5.1:基于ResNet-18骨干网络的PACS数据集上所提方法的性能比较。

The fact that we are able to achieve state-of-the-art results for DIVCAM-S outside of DOMAINBED shows a common problem with works in domain generalization, namely consistency in algorithm comparisons and reproducability. Due to computational constraints, novel methods are often only compared to the results provided in previous works. As a result, details such as the hyperparameter tuning procedure, learning rate schedules, or even the optimizer are often omitted or chosen to fit the algorithm at hand. From some of our experiments, we observe that simply fine-tuning a learning rate schedule for any of the methods from Table 5.2 offers a bigger performance increase than choosing a better algorithm in the first place. As such, design choices can have a heavy impact on how well the algorithm performs. Having a common benchmarking procedure such as DOMAINBED, where these are fixed, is necessary to make substantial progress in this field. We hope that we can push adoption in the community with our addition of RSC and the methods proposed in this work. However, not using learning rate schedules and following the pre-defined distributions for learning rates, batch sizes, or weight decays might inherently bias this comparison and shouldn't be neglected as a factor.

我们能够在DOMAINBED之外为DIVCAM-S实现最先进的结果,这反映了领域泛化研究中一个普遍存在的问题,即算法比较的一致性和可复现性。由于计算资源限制,新的方法通常仅与先前工作的结果进行比较。因此,诸如超参数调优过程、学习率调度,甚至优化器的细节常被省略或被选择以适应当前算法。从我们的一些实验中观察到,仅仅为表5.2中的任一方法微调学习率调度,带来的性能提升往往超过了选择更优算法本身。因此,设计选择对算法性能有重大影响。像DOMAINBED这样固定这些因素的统一基准测试程序,对于推动该领域的实质性进展是必要的。我们希望通过引入RSC及本文提出的方法,推动社区的采纳。然而,不使用学习率调度且遵循预定义的学习率、批量大小或权重衰减分布,可能固有地导致比较偏差,这一点不应被忽视。

On top of that, Table 5.2 also shows the results for D-TRANSFORMERS. Generally, we observe that many variants of the prototype based approaches outlined in Section 3.2.3 fail to generalize well to ResNet-50, even though they exhibit promising performance on ResNet-18 (see Table 5.1 for PRo-DROP results on PACS and ResNet-18). This might be due to the fact that prototypical approaches commonly require higher resolution feature maps e.g. 14×14 in [43] via dilated convolutions. Nevertheless, D-TRANSFORMERS already perform quite well for the benchmarking procedure outlined by DOMAINBED even without these additional changes. Keeping in mind that these adaptions can be made to push the performance even more, makes the approach even more suited for domain generalization although it is unclear how much the other algorithms would benefit from such a change. The value of prototypical approaches does not only lie in good performance but any prototypical approach can also offer a significant amount of explainability since these allow for visualizing the prototype similarity maps, the closest image patches to the prototypes, or even directly the prototypes if a sufficient decoder has jointly been trained. In particular, D-TRANSFORMERS is able to achieve Top-2 performance in DomAINBED for VLCS but performance seems to be not so good for the TerraIncog-nita dataset. We believe that this is because the dataset commonly includes only parts of the different animals at a very high distance, making it hard for the network to extract meaningful prototypes in the first place with the small feature resolution available. As such, it should be the hardest dataset for

此外,表5.2还展示了D-TRANSFORMERS的结果。总体来看,我们观察到第3.2.3节中概述的多种基于原型的方法在ResNet-50上往往难以良好泛化,尽管它们在ResNet-18上表现出较好的性能(参见表5.1中PACS数据集上PRo-DROP的结果)。这可能是因为原型方法通常需要更高分辨率的特征图,例如文献[43]中通过空洞卷积实现的特征图。尽管如此,D-TRANSFORMERS即使未做这些额外调整,也已在DOMAINBED基准测试中表现良好。考虑到这些调整可以进一步提升性能,使该方法更适合领域泛化,尽管尚不清楚其他算法从中受益多少。原型方法的价值不仅在于良好的性能,还在于其显著的可解释性,因为它们允许可视化原型相似度图、与原型最接近的图像块,甚至在联合训练了足够解码器的情况下直接可视化原型。特别地,D-TRANSFORMERS能够在DOMAINBED的VLCS数据集上取得第二名的表现,但在TerraIncognita数据集上的表现似乎不佳。我们认为这是因为该数据集通常只包含距离较远的不同动物的部分,使得网络难以利用有限的特征分辨率提取有意义的原型。因此,它应当是最难的数据集之一。

Algorithm\( \mathbf{{Ref}.} \)VLCSPACSOffice-HomeTerra Inc.DomainNet\( \mathbf{{Avg}.} \)
ERM[187]\( {77.5} \pm {0.4} \)\( {85.5} \pm {0.2} \)\( {66.5} \pm {0.3} \)\( {46.1} \pm {1.8} \)\( {40.9} \pm {0.1} \)63.3
IRM[9]\( {78.5} \pm {0.5} \)\( {83.5} \pm {0.8} \)\( {64.3} \pm {2.2} \)\( {47.6} \pm {0.8} \)\( {33.9} \pm {2.8} \)61.5
GroupDRO[159]\( {76.7} \pm {0.6} \)\( {84.4} \pm {0.8} \)\( {66.0} \pm {0.7} \)\( {43.2} \pm {1.1} \)\( {33.3} \pm {0.2} \)60.7
Mixup[203]\( {77.4} \pm {0.6} \)\( {84.6} \pm {0.6} \)\( {68.1} \pm {0.3} \)\( {47.9} \pm {0.8} \)\( {39.2} \pm {0.1} \)63.4
MLDG[108]\( {77.2} \pm {0.4} \)\( {84.9} \pm {1.0} \)\( {66.8} \pm {0.6} \)\( {47.7} \pm {0.9} \)\( {41.2} \pm {0.1} \)63.5
CORAL[179]\( {78.8} \pm {0.6} \)\( {86.2} \pm {0.3} \)\( {68.7} \pm {0.3} \)\( {47.6} \pm {1.0} \)\( {41.5} \pm {0.1} \)64.5
MMD[112]\( {77.5} \pm {0.9} \)\( {84.6} \pm {0.5} \)\( {66.3} \pm {0.1} \)\( {42.2} \pm {1.6} \)\( {23.4} \pm {9.5} \)58.8
DANN[61]\( {78.6} \pm {0.4} \)\( {83.6} \pm {0.4} \)\( {65.9} \pm {0.6} \)\( {46.7} \pm {0.5} \)\( {38.3} \pm {0.1} \)62.6
CDANN[116]\( {77.5} \pm {0.1} \)\( {82.6} \pm {0.9} \)\( {65.8} \pm {1.3} \)\( {45.8} \pm {1.6} \)\( {38.3} \pm {0.3} \)62.0
MTL[19]\( {77.2} \pm {0.4} \)\( {84.6} \pm {0.5} \)\( {66.4} \pm {0.5} \)\( {45.6} \pm {1.2} \)\( {40.6} \pm {0.1} \)62.8
SagNet[135]\( {77.8} \pm {0.5} \)\( {86.3} \pm {0.2} \)\( {68.1} \pm {0.1} \)\( {48.6} \pm {1.0} \)\( {40.3} \pm {0.1} \)64.2
ARM[212]\( {77.6} \pm {0.3} \)\( {85.1} \pm {0.4} \)\( {64.8} \pm {0.3} \)\( {45.5} \pm {0.3} \)\( {35.5} \pm {0.2} \)61.7
VREx[101]\( {78.3} \pm {0.2} \)\( {84.9} \pm {0.6} \)\( {66.4} \pm {0.6} \)\( {46.4} \pm {0.6} \)\( {33.6} \pm {2.9} \)61.9
RSC[86]\( {77.1} \pm {0.5} \)\( {85.2} \pm {0.9} \)\( {65.5} \pm {0.9} \)\( {46.6} \pm {1.0} \)\( {38.9} \pm {0.5} \)62.7
DivCAM-S(ours)\( {77.8} \pm {0.3} \)\( {85.4} \pm {0.2} \)\( {65.2} \pm {0.3} \)\( {48.0} \pm {1.2} \)\( {40.7} \pm {0.0} \)63.4
D-TRANSFORMERS(ours)\( {78.7} \pm {0.5} \)\( {84.2} \pm {0.1} \)-\( {42.9} \pm {1.1} \)--
ERM*[187]\( {77.6} \pm {0.3} \)\( {86.7} \pm {0.3} \)\( {66.4} \pm {0.5} \)\( {53.0} \pm {0.3} \)\( {41.3} \pm {0.1} \)65.0
IRM*[9]\( {76.9} \pm {0.6} \)\( {84.5} \pm {1.1} \)\( {63.0} \pm {2.7} \)\( {50.5} \pm {0.7} \)\( {28.0} \pm {5.1} \)60.5
GroupDRO*[159]\( {77.4} \pm {0.5} \)\( {87.1} \pm {0.1} \)\( {66.2} \pm {0.6} \)\( {52.4} \pm {0.1} \)\( {33.4} \pm {0.3} \)63.3
Mixup*[203]\( {78.1} \pm {0.3} \)\( {86.8} \pm {0.3} \)\( {68.0} \pm {0.2} \)\( {54.4} \pm {0.3} \)\( {39.6} \pm {0.1} \)65.3
MLDG*[108]\( {77.5} \pm {0.1} \)\( {86.8} \pm {0.4} \)\( {66.6} \pm {0.3} \)\( {52.0} \pm {0.1} \)\( {41.6} \pm {0.1} \)64.9
CORAL*[179]\( {77.7} \pm {0.2} \)\( {87.1} \pm {0.5} \)\( {68.4} \pm {0.2} \)\( {52.8} \pm {0.2} \)\( {41.8} \pm {0.1} \)65.5
MMD*[112]\( {77.9} \pm {0.1} \)\( {87.2} \pm {0.1} \)\( {66.2} \pm {0.3} \)\( {52.0} \pm {0.4} \)\( {23.5} \pm {9.4} \)61.3
DANN*[61]\( {79.7} \pm {0.5} \)\( {85.2} \pm {0.2} \)\( {65.3} \pm {0.8} \)\( {50.6} \pm {0.4} \)\( {38.3} \pm {0.1} \)63.8
CDANN*[116]\( {79.9} \pm {0.2} \)\( {85.8} \pm {0.8} \)\( {65.3} \pm {0.5} \)\( {50.8} \pm {0.6} \)\( {38.5} \pm {0.2} \)64.0
MTL*[19]\( {77.7} \pm {0.5} \)\( {86.7} \pm {0.2} \)\( {66.5} \pm {0.4} \)\( {52.2} \pm {0.4} \)\( {40.8} \pm {0.1} \)64.7
SagNet*[135]\( {77.6} \pm {0.1} \)\( {86.4} \pm {0.4} \)\( {67.5} \pm {0.2} \)\( {52.5} \pm {0.4} \)\( {40.8} \pm {0.2} \)64.9
ARM*[212]\( {77.8} \pm {0.3} \)\( {85.8} \pm {0.2} \)\( {64.8} \pm {0.4} \)\( {51.2} \pm {0.5} \)\( {36.0} \pm {0.2} \)63.1
VREx*[101]\( {78.1} \pm {0.2} \)\( {87.2} \pm {0.6} \)\( {65.7} \pm {0.3} \)\( {51.4} \pm {0.5} \)\( {30.1} \pm {3.7} \)62.5
RSC*[86]\( {77.8} \pm {0.6} \)\( {86.2} \pm {0.5} \)\( {66.5} \pm {0.6} \)\( {52.1} \pm {0.2} \)\( {38.9} \pm {0.6} \)64.3
DivCAM-S*(ours)\( {78.1} \pm {0.6} \)\( {87.2} \pm {0.1} \)\( {65.2} \pm {0.5} \)\( {51.3} \pm {0.5} \)\( {41.0} \pm {0.0} \)64.6
D-TRANSFORMERS*(ours)\( {77.7} \pm {0.1} \)\( {86.9} \pm {0.3} \)-\( {52.4} \pm {0.8} \)--
算法\( \mathbf{{Ref}.} \)VLCSPACSOffice-HomeTerra Inc.DomainNet\( \mathbf{{Avg}.} \)
ERM(经验风险最小化)[187]\( {77.5} \pm {0.4} \)\( {85.5} \pm {0.2} \)\( {66.5} \pm {0.3} \)\( {46.1} \pm {1.8} \)\( {40.9} \pm {0.1} \)63.3
IRM(不变风险最小化)[9]\( {78.5} \pm {0.5} \)\( {83.5} \pm {0.8} \)\( {64.3} \pm {2.2} \)\( {47.6} \pm {0.8} \)\( {33.9} \pm {2.8} \)61.5
GroupDRO(群组分布鲁棒优化)[159]\( {76.7} \pm {0.6} \)\( {84.4} \pm {0.8} \)\( {66.0} \pm {0.7} \)\( {43.2} \pm {1.1} \)\( {33.3} \pm {0.2} \)60.7
Mixup(混合增强)[203]\( {77.4} \pm {0.6} \)\( {84.6} \pm {0.6} \)\( {68.1} \pm {0.3} \)\( {47.9} \pm {0.8} \)\( {39.2} \pm {0.1} \)63.4
MLDG(元学习领域泛化)[108]\( {77.2} \pm {0.4} \)\( {84.9} \pm {1.0} \)\( {66.8} \pm {0.6} \)\( {47.7} \pm {0.9} \)\( {41.2} \pm {0.1} \)63.5
CORAL(相关对齐)[179]\( {78.8} \pm {0.6} \)\( {86.2} \pm {0.3} \)\( {68.7} \pm {0.3} \)\( {47.6} \pm {1.0} \)\( {41.5} \pm {0.1} \)64.5
MMD(最大均值差异)[112]\( {77.5} \pm {0.9} \)\( {84.6} \pm {0.5} \)\( {66.3} \pm {0.1} \)\( {42.2} \pm {1.6} \)\( {23.4} \pm {9.5} \)58.8
DANN(域对抗神经网络)[61]\( {78.6} \pm {0.4} \)\( {83.6} \pm {0.4} \)\( {65.9} \pm {0.6} \)\( {46.7} \pm {0.5} \)\( {38.3} \pm {0.1} \)62.6
CDANN(条件域对抗神经网络)[116]\( {77.5} \pm {0.1} \)\( {82.6} \pm {0.9} \)\( {65.8} \pm {1.3} \)\( {45.8} \pm {1.6} \)\( {38.3} \pm {0.3} \)62.0
MTL(多任务学习)[19]\( {77.2} \pm {0.4} \)\( {84.6} \pm {0.5} \)\( {66.4} \pm {0.5} \)\( {45.6} \pm {1.2} \)\( {40.6} \pm {0.1} \)62.8
SagNet[135]\( {77.8} \pm {0.5} \)\( {86.3} \pm {0.2} \)\( {68.1} \pm {0.1} \)\( {48.6} \pm {1.0} \)\( {40.3} \pm {0.1} \)64.2
ARM[212]\( {77.6} \pm {0.3} \)\( {85.1} \pm {0.4} \)\( {64.8} \pm {0.3} \)\( {45.5} \pm {0.3} \)\( {35.5} \pm {0.2} \)61.7
VREx[101]\( {78.3} \pm {0.2} \)\( {84.9} \pm {0.6} \)\( {66.4} \pm {0.6} \)\( {46.4} \pm {0.6} \)\( {33.6} \pm {2.9} \)61.9
RSC[86]\( {77.1} \pm {0.5} \)\( {85.2} \pm {0.9} \)\( {65.5} \pm {0.9} \)\( {46.6} \pm {1.0} \)\( {38.9} \pm {0.5} \)62.7
DivCAM-S(本方法)\( {77.8} \pm {0.3} \)\( {85.4} \pm {0.2} \)\( {65.2} \pm {0.3} \)\( {48.0} \pm {1.2} \)\( {40.7} \pm {0.0} \)63.4
D-TRANSFORMERS(本方法)\( {78.7} \pm {0.5} \)\( {84.2} \pm {0.1} \)-\( {42.9} \pm {1.1} \)--
ERM*[187]\( {77.6} \pm {0.3} \)\( {86.7} \pm {0.3} \)\( {66.4} \pm {0.5} \)\( {53.0} \pm {0.3} \)\( {41.3} \pm {0.1} \)65.0
IRM*[9]\( {76.9} \pm {0.6} \)\( {84.5} \pm {1.1} \)\( {63.0} \pm {2.7} \)\( {50.5} \pm {0.7} \)\( {28.0} \pm {5.1} \)60.5
GroupDRO*[159]\( {77.4} \pm {0.5} \)\( {87.1} \pm {0.1} \)\( {66.2} \pm {0.6} \)\( {52.4} \pm {0.1} \)\( {33.4} \pm {0.3} \)63.3
Mixup*[203]\( {78.1} \pm {0.3} \)\( {86.8} \pm {0.3} \)\( {68.0} \pm {0.2} \)\( {54.4} \pm {0.3} \)\( {39.6} \pm {0.1} \)65.3
MLDG*[108]\( {77.5} \pm {0.1} \)\( {86.8} \pm {0.4} \)\( {66.6} \pm {0.3} \)\( {52.0} \pm {0.1} \)\( {41.6} \pm {0.1} \)64.9
CORAL*[179]\( {77.7} \pm {0.2} \)\( {87.1} \pm {0.5} \)\( {68.4} \pm {0.2} \)\( {52.8} \pm {0.2} \)\( {41.8} \pm {0.1} \)65.5
MMD*[112]\( {77.9} \pm {0.1} \)\( {87.2} \pm {0.1} \)\( {66.2} \pm {0.3} \)\( {52.0} \pm {0.4} \)\( {23.5} \pm {9.4} \)61.3
DANN*[61]\( {79.7} \pm {0.5} \)\( {85.2} \pm {0.2} \)\( {65.3} \pm {0.8} \)\( {50.6} \pm {0.4} \)\( {38.3} \pm {0.1} \)63.8
CDANN*[116]\( {79.9} \pm {0.2} \)\( {85.8} \pm {0.8} \)\( {65.3} \pm {0.5} \)\( {50.8} \pm {0.6} \)\( {38.5} \pm {0.2} \)64.0
MTL*[19]\( {77.7} \pm {0.5} \)\( {86.7} \pm {0.2} \)\( {66.5} \pm {0.4} \)\( {52.2} \pm {0.4} \)\( {40.8} \pm {0.1} \)64.7
SagNet*[135]\( {77.6} \pm {0.1} \)\( {86.4} \pm {0.4} \)\( {67.5} \pm {0.2} \)\( {52.5} \pm {0.4} \)\( {40.8} \pm {0.2} \)64.9
ARM*[212]\( {77.8} \pm {0.3} \)\( {85.8} \pm {0.2} \)\( {64.8} \pm {0.4} \)\( {51.2} \pm {0.5} \)\( {36.0} \pm {0.2} \)63.1
VREx*[101]\( {78.1} \pm {0.2} \)\( {87.2} \pm {0.6} \)\( {65.7} \pm {0.3} \)\( {51.4} \pm {0.5} \)\( {30.1} \pm {3.7} \)62.5
RSC*[86]\( {77.8} \pm {0.6} \)\( {86.2} \pm {0.5} \)\( {66.5} \pm {0.6} \)\( {52.1} \pm {0.2} \)\( {38.9} \pm {0.6} \)64.3
DivCAM-S*(本方法)\( {78.1} \pm {0.6} \)\( {87.2} \pm {0.1} \)\( {65.2} \pm {0.5} \)\( {51.3} \pm {0.5} \)\( {41.0} \pm {0.0} \)64.6
D-TRANSFORMERS*(本方法)\( {77.7} \pm {0.1} \)\( {86.9} \pm {0.3} \)-\( {52.4} \pm {0.8} \)--

Table 5.2: Performance comparison across datasets using training-domain validation (top) and oracle validation denoted with * (bottom). We use a ResNet-50 backbone, optimize with ADAM, and follow the distributions specified in DOMAINBED. Only RSC and our methods have been added as part of this work, the other baselines are taken from DomAINBED.

表5.2:使用训练域验证(上方)和用*标记的oracle验证(下方)在各数据集上的性能比较。我们采用ResNet-50主干网络,使用ADAM优化器,并遵循DOMAINBED中指定的分布。仅RSC和我们的方法作为本工作新增,其他基线方法均取自DOMAINBED。

D-TRANSFORMERS in DOMAINBED. In Table 5.2 the results for D-TRANSFORMERS on OfficeHome and DomainNet are omited due to computational constraints but we would expect the performances to be on-par if not better compared to the other methods, as these datasets do not share the same prototype extraction difficulties from TerraIncognita.

DOMAINBED中的D-TRANSFORMERS。由于计算资源限制,表5.2中省略了D-TRANSFORMERS在OfficeHome和DomainNet上的结果,但我们预计其性能至少与其他方法持平,甚至更优,因为这些数据集不存在TerraIncognita中原型提取的困难。

5.4 Ablation Studies

5.4 消融研究

In this section, we are going to look at different ablations of the presented methods and how the individual components impact the performance. In particular, for DIVCAM-S, we analyze different methods of resetting the masks (mask batching) in Section 5.4.2 and see how additional methods that are supposed to improve the underlying class activation maps impact performance in Section 5.4.3. On top of that, for ProDrop, we evaluate the effect of self-challenging for different negative weights in Section 5.4.4, as well as the impact of the additional intra-loss factor in Section 5.4.5.

本节将探讨所提出方法的不同消融实验及各组成部分对性能的影响。具体来说,对于DIVCAM-S,我们在5.4.2节分析了不同的掩码重置(掩码批处理)方法,在5.4.3节考察了旨在提升基础类别激活图的额外方法对性能的影响。此外,对于ProDrop,我们在5.4.4节评估了不同负权重下自我挑战机制的效果,在5.4.5节分析了额外的类内损失因子的影响。

5.4.1 Hyperparameter Distributions & Schedules

5.4.1 超参数分布与调度

For the mask batching ablation study we use ADAM [98] and the distributions from Table 5.4. When the batch drop factor is scheduled, we use an increasing linear schedule while the learning rate is always scheduled with a step-decay which decays the learning rate by factor 0.1 at epoch 80/100 .

在掩码批处理消融实验中,我们使用ADAM [98]优化器和表5.4中的分布。当批量丢弃因子采用调度时,使用线性递增调度,而学习率始终采用阶梯衰减调度,在第80/100个epoch时将学习率降低0.1倍。

AlgorithmRef.Backbone\( \mathbf{P} \)ACSAvg.
BASELINE[28]ResNet-1895.7377.8574.8667.7479.05
MASF[46]ResNet-1894.9980.2977.1771.6981.03
EPI-FCR[110]ResNet-1893.9082.1077.0073.0081.50
JIGEN[28]ResNet-1896.0379.4275.2571.3580.51
MetaReg[15]ResNet-1895.5083.7077.2070.3081.70
RSC (reported)[86]ResNet-1895.9983.4380.3180.8585.15
RSC (reproduced)[86]ResNet-1893.7380.4177.5380.7983.12
DivCAM-S(ours)ResNet-1896.1180.2777.8282.1884.10
算法参考主干网络\( \mathbf{P} \)ACS平均
基线[28]ResNet-1895.7377.8574.8667.7479.05
MASF[46]ResNet-1894.9980.2977.1771.6981.03
EPI-FCR[110]ResNet-1893.9082.1077.0073.0081.50
JIGEN[28]ResNet-1896.0379.4275.2571.3580.51
MetaReg[15]ResNet-1895.5083.7077.2070.3081.70
RSC(报告值)[86]ResNet-1895.9983.4380.3180.8585.15
RSC(复现值)[86]ResNet-1893.7380.4177.5380.7983.12
DivCAM-S(本方法)ResNet-1896.1180.2777.8282.1884.10

Table 5.3: Performance comparison for PACS outside of the DOMAINBED framework with the official data split.

表5.3:在官方数据划分下,DOMAINBED框架外PACS的性能比较。

HyperparameterDistribution
\( \alpha \)learning rate\( {\mathcal{{LU}}}_{10}\left( {-5, - 1}\right) \)
\( \mathcal{B} \)batch size\( \lfloor \mathcal{L}{U}_{2}\left( {3,9}\right) \rfloor \)
\( \gamma \)weight decay\( {\mathcal{{LU}}}_{10}\left( {-6, - 2}\right) \)
\( p \)feature drop factor1/3
\( b \)batch drop factor\( \mathcal{U}\left( {0,1}\right) \)
超参数分布
\( \alpha \)学习率\( {\mathcal{{LU}}}_{10}\left( {-5, - 1}\right) \)
\( \mathcal{B} \)批量大小\( \lfloor \mathcal{L}{U}_{2}\left( {3,9}\right) \rfloor \)
\( \gamma \)权重衰减\( {\mathcal{{LU}}}_{10}\left( {-6, - 2}\right) \)
\( p \)特征丢弃因子1/3
\( b \)批量丢弃因子\( \mathcal{U}\left( {0,1}\right) \)

Table 5.4: Hyperparameters and distributions used in random search for the mask batching ablation study. LUx(a,b) denotes a log-uniform distribution between a and b for base x ,the uniform distribution is denoted as U(a,b) and is the floor operator.

表5.4:用于掩码批处理消融研究的随机搜索超参数及其分布。LUx(a,b)表示以x为底,在ab之间的对数均匀分布,均匀分布表示为U(a,b)为向下取整运算符。

For the mask ablation study, we use ADAM [98] and the distributions from Table 5.5. When the batch drop factor is scheduled, we use an increasing linear schedule while the learning rate is not scheduled. This corresponds to the tuning distributions provided in DOMAINBED which are also used for all the main results and all other ablations. If not marked otherwise, each experiment evaluates 20 hyperparameter samples, similar to what is suggested in DOMAINBED.

在掩码消融研究中,我们使用ADAM [98]优化器和表5.5中的分布。当批次丢弃因子被调度时,我们采用线性递增调度,而学习率不进行调度。这对应于DOMAINBED中提供的调优分布,该分布也用于所有主要结果和其他所有消融实验。除非另有说明,每个实验评估20个超参数样本,类似于DOMAINBED的建议。

5.4.2 DivCAM: Mask Batching

5.4.2 DivCAM:掩码批处理

There exist several methods how we can compute the vector c in our method and hence determine how we should apply the masks within each batch. Here, we analyze the effect on performance for a few possible choices. By default,DivCAM uses Equation (5.1) where ygt is the confidence on the ground truth class after softmax. This applies the masks on samples with the highest confidence on the correct class within each batch.

在我们的方法中,存在多种计算向量c的方法,从而决定如何在每个批次内应用掩码。这里,我们分析几种可能选择对性能的影响。默认情况下,DivCAM使用公式(5.1),其中ygt是经过softmax后对真实类别的置信度。该方法在每个批次中对对正确类别置信度最高的样本应用掩码。

(5.1)cn=ygt

Another option is DIVCAM-C with Equation (5.2) which computes the change in confidence on the ground truth class when applying the mask. The masked confidence after softmax is denoted as y~gt . This variation applies the masks for samples where the mask decreases confidence on the ground truth class the most.

另一种选择是DIVCAM-C,使用公式(5.2)计算应用掩码时对真实类别置信度的变化。掩码后经过softmax的置信度表示为y~gt。该变体对掩码使真实类别置信度下降最多的样本应用掩码。

(5.2)cn=ygty~gt

The last variation is DIVCAM-T where we apply the masks randomly for samples which are correctly classified. All variants can further be extended by adding a linear schedule, denoted with an additional "S",or computing c for each domain separately,denoted with an additional "D". By adding a schedule, we apply masks more in the later training epochs where discriminant features have been learned and by enforcing it for each domain we can ensure that the masks aren't applied biased towards a subset of domains and disregarded for others. Table 5.6 shows experiments for these variants.

最后一种变体是DIVCAM-T,我们对正确分类的样本随机应用掩码。所有变体还可以通过添加线性调度(用额外的"S"表示)或对每个域单独计算c(用额外的"D"表示)进行扩展。通过添加调度,我们在训练后期更多地应用掩码,此时判别特征已被学习;通过对每个域强制执行,可以确保掩码不会偏向某些域而忽视其他域。表5.6展示了这些变体的实验结果。

HyperparameterDistribution
\( \alpha \)learning rate\( {\mathcal{{LU}}}_{10}\left( {-5, - {3.5}}\right) \)
\( \mathcal{B} \)batch size\( \lfloor {\mathcal{{LU}}}_{2}\left( {3,{5.5}}\right) \rfloor \)
\( \gamma \)weight decay\( {\mathcal{{LU}}}_{10}\left( {-6, - 2}\right) \)
\( p \)feature drop factor\( \mathcal{U}\left( {{0.2},{0.5}}\right) \)
\( b \)batch drop factor\( \mathcal{U}\left( {0,1}\right) \)
\( {\lambda }_{1} \)hnc factor\( {\mathcal{{LU}}}_{10}\left( {-3, - 1}\right) \)
\( k \)negative classesnum_classes -1
\( {\lambda }_{tap} \)tap factor\( \mathcal{U}\left( {0,1}\right) \)
\( {\lambda }_{2} \)adversarial factor\( {\mathcal{{LU}}}_{10}\left( {-2,2}\right) \)
\( S \)disc per gen step\( \lfloor \mathcal{L}{\mathcal{U}}_{2}\left( {0,3}\right) \rfloor \)
\( \eta \)gradient penalty\( {\mathcal{{LU}}}_{10}\left( {-2,1}\right) \)
\( {\omega }_{s} \)mlp width512
\( {\omega }_{d} \)mlp depth3
\( {\omega }_{dr} \)mlp dropout0.5
\( {\lambda }_{3} \)mmd factor\( \mathcal{L}{\mathcal{U}}_{10}\left( {-1,1}\right) \)
超参数分布
\( \alpha \)学习率\( {\mathcal{{LU}}}_{10}\left( {-5, - {3.5}}\right) \)
\( \mathcal{B} \)批量大小\( \lfloor {\mathcal{{LU}}}_{2}\left( {3,{5.5}}\right) \rfloor \)
\( \gamma \)权重衰减\( {\mathcal{{LU}}}_{10}\left( {-6, - 2}\right) \)
\( p \)特征丢弃因子\( \mathcal{U}\left( {{0.2},{0.5}}\right) \)
\( b \)批量丢弃因子\( \mathcal{U}\left( {0,1}\right) \)
\( {\lambda }_{1} \)hnc因子\( {\mathcal{{LU}}}_{10}\left( {-3, - 1}\right) \)
\( k \)负类类别数 -1
\( {\lambda }_{tap} \)tap因子\( \mathcal{U}\left( {0,1}\right) \)
\( {\lambda }_{2} \)对抗因子\( {\mathcal{{LU}}}_{10}\left( {-2,2}\right) \)
\( S \)每生成步骤判别器次数\( \lfloor \mathcal{L}{\mathcal{U}}_{2}\left( {0,3}\right) \rfloor \)
\( \eta \)梯度惩罚\( {\mathcal{{LU}}}_{10}\left( {-2,1}\right) \)
\( {\omega }_{s} \)多层感知机宽度512
\( {\omega }_{d} \)多层感知机深度3
\( {\omega }_{dr} \)多层感知机丢弃率0.5
\( {\lambda }_{3} \)最大均值差异因子\( \mathcal{L}{\mathcal{U}}_{10}\left( {-1,1}\right) \)

Table 5.5: Hyperparameters and distributions used in random search for the mask ablation study. LUx(a,b) denotes a log-uniform distribution between a and b for base x ,the uniform distribution is denoted as U(a,b), is the floor operator.

表5.5:用于掩码消融研究的随机搜索超参数及其分布。LUx(a,b)表示以x为底的对数均匀分布,范围在ab之间,均匀分布表示为U(a,b),,其中U(a,b),是向下取整运算符。

We observe that adding a schedule helps in most cases, achieving the highest training domain validation performance for DIvCAM-S. Enforcing the application of masks within each domain, however, doesn't consistently improve performance and therefore we don't consider it for the final method.

我们观察到,添加调度在大多数情况下都有帮助,DIvCAM-S在训练域验证上达到了最高性能。然而,强制在每个域内应用掩码并未持续提升性能,因此我们未将其纳入最终方法。

5.4.3 DivCAM: Class Activation Maps

5.4.3 DivCAM:类激活图

We combine our class activation maps with other methods from domain generalization, as well as methods to boost the explainability for class activation maps from the weakly-supervised object localization literature. The results are shown in Table 5.7 where MAP + CDANN drops the self-challenging part from DIVCAM and just computes ordinary cross entropy while aligning the class activation maps.

我们将类激活图与领域泛化的其他方法结合,以及来自弱监督目标定位文献中提升类激活图可解释性的方法。结果如表5.7所示,其中MAP + CDANN去除了DIVCAM中的自我挑战部分,仅计算普通交叉熵,同时对齐类激活图。

We observe that, surprisingly, none of the methods have a positive effect for training domain validation even though some of them exhibit better performance for oracle validation. Notably, especially the CDANN approach tends to exhibit a high standard deviation for some of the domains (e.g. art) which suggests that this approach can be fine-tuned when only reporting performance on a single seed. However, since we are looking for a method which reliably provides competitive results, regardless of the used seed and without needing extensive fine-tuning, we disregard this option.

令人惊讶的是,我们观察到这些方法对训练域验证均无正面影响,尽管部分方法在oracle验证上表现更佳。值得注意的是,尤其是CDANN方法在某些域(如艺术域)表现出较高的标准差,表明该方法在仅报告单一随机种子性能时可能经过微调。然而,由于我们寻求的是一种无论随机种子如何且无需大量微调都能稳定提供竞争性结果的方法,因此我们放弃了该选项。

5.4.4 ProDrop: Self-Challenging

5.4.4 ProDrop:自我挑战

Table 5.8 shows the ablation results for the self-challenging addition. We observe that for most negative weights, adding self-challenging results in an performance increase. Most notably, this occurs for cases where the performance without self-challenging is very poor such as wc,j=0.2j:pjPc , wc,j=1.0j:pjPc ,or wc,j=0.0j:pjPc . Generally,in cases where it does not lead to a performance increase, the downside seems to be very small where we at most drop by 0.5% performance for wc,j=0.5j:pjPc .

表5.8展示了自我挑战添加的消融结果。我们观察到,对于大多数负权重,加入自我挑战会带来性能提升。尤其是在无自我挑战时性能极差的情况,如wc,j=0.2j:pjPcwc,j=1.0j:pjPcwc,j=0.0j:pjPc。总体而言,在未带来性能提升的情况下,性能下降幅度很小,最多在wc,j=0.5j:pjPc上下降0.5%。

If we look at the performance changes solely based on the different negative weights, we can't observe any consistent trends. In fact,it is very surprising that small changes such as from wc,j=0.1 to wc,j=0.2 without self-challenging can lead to a 2% performance change.

如果仅根据不同负权重观察性能变化,我们无法发现任何一致的趋势。事实上,令人惊讶的是,诸如从wc,j=0.1wc,j=0.2这样的小变化,在无自我挑战时竟能导致2%的性能变化。

NamePACS\( \mathbf{{Avg}.} \)
DIVCAM\( {94.0} \pm {0.4} \)\( {80.6} \pm {1.2} \)\( {75.4} \pm {0.7} \)\( {76.7} \pm {0.7} \)\( {81.7} \pm {0.6} \)
DivCAM-S\( {94.4} \pm {0.7} \)\( {80.5} \pm {0.4} \)\( {74.6} \pm {2.2} \)\( {79.0} \pm {0.9} \)\( {82.1} \pm {0.3} \)
DivCAM-D\( {94.3} \pm {0.1} \)\( {80.1} \pm {0.1} \)\( {74.5} \pm {0.9} \)\( {76.6} \pm {1.7} \)\( {81.4} \pm {0.2} \)
DIVCAM-DS\( {93.9} \pm {0.2} \)\( {80.4} \pm {0.4} \)\( {73.4} \pm {2.2} \)\( {74.8} \pm {1.2} \)\( {80.6} \pm {0.9} \)
DivCAM-C\( {92.6} \pm {0.4} \)\( {80.1} \pm {1.1} \)\( {73.6} \pm {1.4} \)\( {75.0} \pm {1.2} \)\( {80.3} \pm {0.9} \)
DIVCAM-CS\( {95.0} \pm {0.6} \)\( {79.9} \pm {1.0} \)\( {74.5} \pm {0.7} \)\( {78.1} \pm {0.8} \)\( {81.9} \pm {0.4} \)
DivCAM-DC\( {95.1} \pm {0.4} \)\( {79.5} \pm {1.0} \)\( {73.7} \pm {0.9} \)\( {75.2} \pm {1.2} \)\( {80.9} \pm {0.4} \)
DIVCAM-DCS\( {93.5} \pm {0.1} \)\( {80.1} \pm {0.2} \)\( {75.1} \pm {0.1} \)\( {77.2} \pm {1.6} \)\( {81.5} \pm {0.5} \)
DIVCAM-T\( {95.0} \pm {0.3} \)\( {80.3} \pm {0.3} \)\( {74.8} \pm {0.8} \)\( {75.3} \pm {1.1} \)\( {81.4} \pm {0.4} \)
DIVCAM-TS\( {95.0} \pm {0.1} \)\( {79.9} \pm {0.8} \)\( {72.6} \pm {1.3} \)\( {77.1} \pm {1.4} \)\( {81.2} \pm {0.4} \)
DivCAM-DT\( {94.8} \pm {0.6} \)\( {79.6} \pm {0.6} \)\( {74.0} \pm {1.1} \)\( {78.5} \pm {0.4} \)\( {81.7} \pm {0.1} \)
DIVCAM-DTS\( {95.1} \pm {0.2} \)\( {81.5} \pm {1.3} \)\( {75.5} \pm {0.4} \)\( {74.9} \pm {2.0} \)\( {81.7} \pm {0.5} \)
DivCAM*\( {94.9} \pm {0.7} \)\( {81.5} \pm {0.7} \)\( {76.6} \pm {0.4} \)\( {80.5} \pm {0.7} \)\( {83.4} \pm {0.3} \)
DIVCAM-S*\( {94.9} \pm {0.3} \)\( {82.7} \pm {0.7} \)\( {76.3} \pm {0.7} \)\( {80.1} \pm {0.4} \)\( {83.5} \pm {0.3} \)
DIVCAM-D*\( {94.8} \pm {0.2} \)\( {81.0} \pm {0.7} \)\( \mathbf{{77.6} \pm {0.6}} \)\( {79.9} \pm {0.6} \)\( {83.3} \pm {0.3} \)
DIVCAM-DS*\( {94.6} \pm {0.5} \)\( {80.7} \pm {0.3} \)\( {77.0} \pm {0.4} \)\( {79.3} \pm {0.3} \)\( {82.9} \pm {0.1} \)
DIVCAM-C*\( {94.7} \pm {0.5} \)\( {82.6} \pm {0.6} \)\( {77.0} \pm {0.5} \)\( {80.1} \pm {1.0} \)\( \mathbf{{83.6} \pm {0.3}} \)
DIVCAM-CS*\( {94.2} \pm {0.2} \)\( {82.5} \pm {0.8} \)\( {76.9} \pm {0.3} \)\( {79.9} \pm {0.7} \)\( {83.4} \pm {0.3} \)
DIVCAM-DC*\( {94.8} \pm {0.4} \)\( {82.0} \pm {0.4} \)\( {76.6} \pm {0.9} \)\( {80.1} \pm {0.4} \)\( {83.4} \pm {0.1} \)
DIVCAM-DCS*\( {94.7} \pm {0.4} \)\( {81.0} \pm {0.3} \)\( {77.6} \pm {0.2} \)\( {80.3} \pm {1.3} \)\( {83.4} \pm {0.3} \)
DIVCAM-T*\( {94.5} \pm {0.4} \)\( {81.6} \pm {0.8} \)\( {76.7} \pm {0.2} \)\( {79.6} \pm {0.4} \)\( {83.1} \pm {0.4} \)
DIVCAM-TS*\( {94.8} \pm {0.3} \)\( {81.3} \pm {0.2} \)\( {76.7} \pm {0.5} \)\( {79.7} \pm {0.5} \)\( {83.2} \pm {0.2} \)
DIVCAM-DT*\( {94.7} \pm {0.5} \)\( {80.9} \pm {1.1} \)\( {77.3} \pm {0.5} \)\( {79.9} \pm {0.6} \)\( {83.2} \pm {0.2} \)
DIVCAM-DTS*\( {94.7} \pm {0.5} \)\( {82.1} \pm {1.0} \)\( {76.4} \pm {0.6} \)\( {79.5} \pm {1.2} \)\( {83.2} \pm {0.1} \)
名称PACS\( \mathbf{{Avg}.} \)
DIVCAM\( {94.0} \pm {0.4} \)\( {80.6} \pm {1.2} \)\( {75.4} \pm {0.7} \)\( {76.7} \pm {0.7} \)\( {81.7} \pm {0.6} \)
DivCAM-S\( {94.4} \pm {0.7} \)\( {80.5} \pm {0.4} \)\( {74.6} \pm {2.2} \)\( {79.0} \pm {0.9} \)\( {82.1} \pm {0.3} \)
DivCAM-D\( {94.3} \pm {0.1} \)\( {80.1} \pm {0.1} \)\( {74.5} \pm {0.9} \)\( {76.6} \pm {1.7} \)\( {81.4} \pm {0.2} \)
DIVCAM-DS\( {93.9} \pm {0.2} \)\( {80.4} \pm {0.4} \)\( {73.4} \pm {2.2} \)\( {74.8} \pm {1.2} \)\( {80.6} \pm {0.9} \)
DivCAM-C\( {92.6} \pm {0.4} \)\( {80.1} \pm {1.1} \)\( {73.6} \pm {1.4} \)\( {75.0} \pm {1.2} \)\( {80.3} \pm {0.9} \)
DIVCAM-CS\( {95.0} \pm {0.6} \)\( {79.9} \pm {1.0} \)\( {74.5} \pm {0.7} \)\( {78.1} \pm {0.8} \)\( {81.9} \pm {0.4} \)
DIVCAM-DC\( {95.1} \pm {0.4} \)\( {79.5} \pm {1.0} \)\( {73.7} \pm {0.9} \)\( {75.2} \pm {1.2} \)\( {80.9} \pm {0.4} \)
DIVCAM-DCS\( {93.5} \pm {0.1} \)\( {80.1} \pm {0.2} \)\( {75.1} \pm {0.1} \)\( {77.2} \pm {1.6} \)\( {81.5} \pm {0.5} \)
DIVCAM-T\( {95.0} \pm {0.3} \)\( {80.3} \pm {0.3} \)\( {74.8} \pm {0.8} \)\( {75.3} \pm {1.1} \)\( {81.4} \pm {0.4} \)
DIVCAM-TS\( {95.0} \pm {0.1} \)\( {79.9} \pm {0.8} \)\( {72.6} \pm {1.3} \)\( {77.1} \pm {1.4} \)\( {81.2} \pm {0.4} \)
DivCAM-DT\( {94.8} \pm {0.6} \)\( {79.6} \pm {0.6} \)\( {74.0} \pm {1.1} \)\( {78.5} \pm {0.4} \)\( {81.7} \pm {0.1} \)
DIVCAM-DTS\( {95.1} \pm {0.2} \)\( {81.5} \pm {1.3} \)\( {75.5} \pm {0.4} \)\( {74.9} \pm {2.0} \)\( {81.7} \pm {0.5} \)
DivCAM*\( {94.9} \pm {0.7} \)\( {81.5} \pm {0.7} \)\( {76.6} \pm {0.4} \)\( {80.5} \pm {0.7} \)\( {83.4} \pm {0.3} \)
DIVCAM-S*\( {94.9} \pm {0.3} \)\( {82.7} \pm {0.7} \)\( {76.3} \pm {0.7} \)\( {80.1} \pm {0.4} \)\( {83.5} \pm {0.3} \)
DIVCAM-D*\( {94.8} \pm {0.2} \)\( {81.0} \pm {0.7} \)\( \mathbf{{77.6} \pm {0.6}} \)\( {79.9} \pm {0.6} \)\( {83.3} \pm {0.3} \)
DIVCAM-DS*\( {94.6} \pm {0.5} \)\( {80.7} \pm {0.3} \)\( {77.0} \pm {0.4} \)\( {79.3} \pm {0.3} \)\( {82.9} \pm {0.1} \)
DIVCAM-C*\( {94.7} \pm {0.5} \)\( {82.6} \pm {0.6} \)\( {77.0} \pm {0.5} \)\( {80.1} \pm {1.0} \)\( \mathbf{{83.6} \pm {0.3}} \)
DIVCAM-CS*\( {94.2} \pm {0.2} \)\( {82.5} \pm {0.8} \)\( {76.9} \pm {0.3} \)\( {79.9} \pm {0.7} \)\( {83.4} \pm {0.3} \)
DIVCAM-DC*\( {94.8} \pm {0.4} \)\( {82.0} \pm {0.4} \)\( {76.6} \pm {0.9} \)\( {80.1} \pm {0.4} \)\( {83.4} \pm {0.1} \)
DIVCAM-DCS*\( {94.7} \pm {0.4} \)\( {81.0} \pm {0.3} \)\( {77.6} \pm {0.2} \)\( {80.3} \pm {1.3} \)\( {83.4} \pm {0.3} \)
DIVCAM-T*\( {94.5} \pm {0.4} \)\( {81.6} \pm {0.8} \)\( {76.7} \pm {0.2} \)\( {79.6} \pm {0.4} \)\( {83.1} \pm {0.4} \)
DIVCAM-TS*\( {94.8} \pm {0.3} \)\( {81.3} \pm {0.2} \)\( {76.7} \pm {0.5} \)\( {79.7} \pm {0.5} \)\( {83.2} \pm {0.2} \)
DIVCAM-DT*\( {94.7} \pm {0.5} \)\( {80.9} \pm {1.1} \)\( {77.3} \pm {0.5} \)\( {79.9} \pm {0.6} \)\( {83.2} \pm {0.2} \)
DIVCAM-DTS*\( {94.7} \pm {0.5} \)\( {82.1} \pm {1.0} \)\( {76.4} \pm {0.6} \)\( {79.5} \pm {1.2} \)\( {83.2} \pm {0.1} \)

Table 5.6: Ablation study for the DIVCAM mask batching on the PACS dataset using training-domain validation (top) and oracle validation denoted with * (bottom). We use a ResNet-18 backbone, schedules and distributions from Section 5.4.1, 25 hyperparameter samples, and 3 split seeds for standard deviations.

表5.6:在PACS数据集上使用训练域验证(上方)和用*标记的oracle验证(下方)进行DIVCAM掩码批处理的消融研究。我们使用ResNet-18主干网络,调度和分布来自第5.4.1节,25个超参数样本,以及3个分割随机种子计算标准差。

5.4.5 ProDrop: Intra-Loss

5.4.5 ProDrop:内部损失

Table 5.9 shows the ablation results for different intra factor strengths λ6 with and without self-challenging. As expected, if the intra factor grows too large, performance degrades. For smaller values of λ6 without self-challenging,we can observe a consistent performance increase of varying degrees similar to what self-challenging is able to gain.

表5.9展示了不同内部因子强度λ6在有无自我挑战情况下的消融结果。如预期,若内部因子过大,性能会下降。对于较小的λ6值且无自我挑战时,我们观察到类似自我挑战所带来的不同程度的持续性能提升。

Even though we experimented with different weighting of the individual distance metrics as well as the overall loss, we observe no consistent performance improvements on top of self-challenging across testing environments and data splits. The observations from Section 5.4.5, Figure 4.6, and Appendix B suggest that self-challenging inherently already enforces the desired properties up to the near-optimal extent such that the additional loss term does not provide any consistent further benefits.

尽管我们尝试了对各个距离度量以及整体损失的不同加权,但在测试环境和数据划分中,未观察到在自我挑战基础上有一致的性能提升。第5.4.5节、图4.6及附录B的观察表明,自我挑战本质上已近乎最优地强制实现了期望属性,因此额外的损失项并未带来持续的额外收益。

Name\( \mathbf{P} \)ACSAvg.
DIVCAM\( {97.6} \pm {0.4} \)\( {85.2} \pm {0.8} \)\( {80.5} \pm {0.7} \)\( {78.3} \pm {0.8} \)\( {85.4} \pm {0.5} \)
DivCAM-S\( {97.3} \pm {0.4} \)\( {86.2} \pm {1.4} \)\( {79.1} \pm {2.2} \)\( {79.2} \pm {0.1} \)\( {85.4} \pm {0.2} \)
DIVCAM-S + TAP\( {96.9} \pm {0.1} \)\( {85.1} \pm {1.5} \)\( {78.7} \pm {0.4} \)\( {75.3} \pm {0.6} \)\( {84.0} \pm {0.4} \)
DivCAM-S + HNC\( {97.2} \pm {0.3} \)\( {87.2} \pm {0.9} \)\( {79.2} \pm {0.6} \)\( {71.7} \pm {3.1} \)\( {83.8} \pm {0.4} \)
DivCAM-S + CDANN\( {97.5} \pm {0.4} \)\( {85.2} \pm {2.8} \)\( {78.3} \pm {2.0} \)\( {74.8} \pm {0.9} \)\( {84.0} \pm {1.5} \)
DIVCAM-S + MMD\( {97.0} \pm {0.2} \)\( {85.4} \pm {1.0} \)\( {81.5} \pm {0.4} \)\( {75.8} \pm {3.5} \)\( {84.9} \pm {1.1} \)
CAM + CDANN\( {97.2} \pm {0.3} \)\( {86.7} \pm {0.5} \)\( {77.3} \pm {1.7} \)\( {71.5} \pm {1.3} \)\( {83.2} \pm {0.8} \)
DivCAM*\( {96.2} \pm {1.2} \)\( {87.0} \pm {0.5} \)\( {82.0} \pm {0.9} \)\( {80.8} \pm {0.6} \)\( {86.5} \pm {0.1} \)
DIVCAM-S*\( {97.2} \pm {0.3} \)\( {86.5} \pm {0.4} \)\( {83.0} \pm {0.5} \)\( {82.2} \pm {0.1} \)\( {87.2} \pm {0.1} \)
DIVCAM-S + TAP*\( {97.3} \pm {0.3} \)\( {87.2} \pm {0.8} \)\( \mathbf{{83.2} \pm {0.8}} \)\( {82.8} \pm {0.2} \)\( {87.6} \pm {0.0} \)
DIVCAM-S + HNC*\( {97.3} \pm {0.2} \)\( {87.4} \pm {0.5} \)\( {81.4} \pm {0.6} \)\( {79.7} \pm {1.1} \)\( {86.5} \pm {0.4} \)
DIVCAM-S + CDANN*\( {97.3} \pm {0.5} \)\( {85.9} \pm {1.2} \)\( {80.6} \pm {0.4} \)\( {80.9} \pm {0.4} \)\( {86.2} \pm {0.2} \)
DivCAM-S + MMD*\( {97.3} \pm {0.5} \)\( {86.8} \pm {0.7} \)\( {83.2} \pm {0.4} \)\( {80.9} \pm {0.7} \)\( {87.1} \pm {0.4} \)
CAM + CDANN*\( {97.2} \pm {0.4} \)\( {86.7} \pm {0.5} \)\( {81.9} \pm {0.2} \)\( {80.6} \pm {0.7} \)\( {86.6} \pm {0.2} \)
名称\( \mathbf{P} \)ACS平均
DIVCAM\( {97.6} \pm {0.4} \)\( {85.2} \pm {0.8} \)\( {80.5} \pm {0.7} \)\( {78.3} \pm {0.8} \)\( {85.4} \pm {0.5} \)
DivCAM-S\( {97.3} \pm {0.4} \)\( {86.2} \pm {1.4} \)\( {79.1} \pm {2.2} \)\( {79.2} \pm {0.1} \)\( {85.4} \pm {0.2} \)
DIVCAM-S + TAP\( {96.9} \pm {0.1} \)\( {85.1} \pm {1.5} \)\( {78.7} \pm {0.4} \)\( {75.3} \pm {0.6} \)\( {84.0} \pm {0.4} \)
DivCAM-S + HNC\( {97.2} \pm {0.3} \)\( {87.2} \pm {0.9} \)\( {79.2} \pm {0.6} \)\( {71.7} \pm {3.1} \)\( {83.8} \pm {0.4} \)
DivCAM-S + CDANN\( {97.5} \pm {0.4} \)\( {85.2} \pm {2.8} \)\( {78.3} \pm {2.0} \)\( {74.8} \pm {0.9} \)\( {84.0} \pm {1.5} \)
DIVCAM-S + MMD\( {97.0} \pm {0.2} \)\( {85.4} \pm {1.0} \)\( {81.5} \pm {0.4} \)\( {75.8} \pm {3.5} \)\( {84.9} \pm {1.1} \)
CAM + CDANN\( {97.2} \pm {0.3} \)\( {86.7} \pm {0.5} \)\( {77.3} \pm {1.7} \)\( {71.5} \pm {1.3} \)\( {83.2} \pm {0.8} \)
DivCAM*\( {96.2} \pm {1.2} \)\( {87.0} \pm {0.5} \)\( {82.0} \pm {0.9} \)\( {80.8} \pm {0.6} \)\( {86.5} \pm {0.1} \)
DIVCAM-S*\( {97.2} \pm {0.3} \)\( {86.5} \pm {0.4} \)\( {83.0} \pm {0.5} \)\( {82.2} \pm {0.1} \)\( {87.2} \pm {0.1} \)
DIVCAM-S + TAP*\( {97.3} \pm {0.3} \)\( {87.2} \pm {0.8} \)\( \mathbf{{83.2} \pm {0.8}} \)\( {82.8} \pm {0.2} \)\( {87.6} \pm {0.0} \)
DIVCAM-S + HNC*\( {97.3} \pm {0.2} \)\( {87.4} \pm {0.5} \)\( {81.4} \pm {0.6} \)\( {79.7} \pm {1.1} \)\( {86.5} \pm {0.4} \)
DIVCAM-S + CDANN*\( {97.3} \pm {0.5} \)\( {85.9} \pm {1.2} \)\( {80.6} \pm {0.4} \)\( {80.9} \pm {0.4} \)\( {86.2} \pm {0.2} \)
DivCAM-S + MMD*\( {97.3} \pm {0.5} \)\( {86.8} \pm {0.7} \)\( {83.2} \pm {0.4} \)\( {80.9} \pm {0.7} \)\( {87.1} \pm {0.4} \)
CAM + CDANN*\( {97.2} \pm {0.4} \)\( {86.7} \pm {0.5} \)\( {81.9} \pm {0.2} \)\( {80.6} \pm {0.7} \)\( {86.6} \pm {0.2} \)

Table 5.7: Ablation study for the DIVCAM masks on the PACS dataset using training-domain validation (top) and oracle validation denoted with * (bottom). We use a ResNet-50 backbone, schedules and distributions from Section 5.4.1, 20 hyperparameter samples, and 3 split seeds for standard deviations. Results are directly integratable in Table 5.2 as we use the same tuning protocol provided in DomainBED.

表5.7:在PACS数据集上使用训练域验证(上方)和带*标记的oracle验证(下方)对DIVCAM掩码的消融研究。我们采用ResNet-50骨干网络,调度和分布来自第5.4.1节,20个超参数样本,以及3个划分随机种子计算标准差。结果可直接整合入表5.2,因为我们使用了DomainBED中提供的相同调优协议。

WeightSCPACSAvg.
0.0\( {93.2} \pm {0.0} \)\( {80.4} \pm {1.0} \)\( {73.7} \pm {0.4} \)\( {72.6} \pm {2.6} \)\( {80.0} \pm {0.7} \)
-0.1\( {94.3} \pm {0.3} \)\( {78.8} \pm {0.3} \)\( {74.4} \pm {0.7} \)\( {75.3} \pm {1.6} \)\( {80.7} \pm {0.4} \)
-0.2\( {93.2} \pm {0.5} \)\( {76.8} \pm {1.5} \)\( {72.9} \pm {0.1} \)\( {71.8} \pm {1.0} \)\( {78.7} \pm {0.5} \)
-0.3\( {94.0} \pm {0.4} \)\( {79.8} \pm {1.0} \)\( {75.6} \pm {1.5} \)\( {73.9} \pm {1.0} \)\( {80.8} \pm {0.1} \)
-0.4\( {93.6} \pm {0.1} \)\( {79.8} \pm {0.5} \)\( {74.3} \pm {1.3} \)\( {75.7} \pm {2.3} \)\( {80.8} \pm {0.5} \)
-0.5\( {93.0} \pm {0.9} \)\( {79.4} \pm {1.6} \)\( {73.2} \pm {0.9} \)\( {75.5} \pm {1.0} \)\( {80.3} \pm {0.4} \)
-1.0\( {94.2} \pm {0.4} \)\( {80.2} \pm {1.1} \)\( {72.7} \pm {1.4} \)\( {68.6} \pm {0.6} \)\( {78.9} \pm {0.2} \)
-2.0\( {94.7} \pm {0.4} \)\( {78.7} \pm {0.5} \)\( {75.5} \pm {1.0} \)\( {71.1} \pm {2.2} \)\( {80.0} \pm {0.7} \)
\( {0.0} \rightarrow - {1.0} \)\( {94.3} \pm {0.5} \)\( {79.5} \pm {0.6} \)\( {74.2} \pm {0.3} \)\( {72.2} \pm {2.1} \)\( {80.0} \pm {0.4} \)
0.0\( {93.4} \pm {0.6} \)\( {80.5} \pm {0.8} \)\( {75.6} \pm {0.1} \)\( {74.3} \pm {2.0} \)\( {81.0} \pm {0.4} \)
-0.1\( {93.7} \pm {0.3} \)\( {83.2} \pm {1.4} \)\( {75.9} \pm {1.2} \)\( {71.1} \pm {1.9} \)\( {81.0} \pm {0.8} \)
-0.2\( {93.2} \pm {0.2} \)\( {81.0} \pm {0.7} \)\( {73.9} \pm {0.7} \)\( {75.0} \pm {0.6} \)\( {80.8} \pm {0.1} \)
-0.3\( {93.4} \pm {0.8} \)\( {81.4} \pm {0.4} \)\( {71.3} \pm {1.2} \)\( {76.9} \pm {0.9} \)\( {80.7} \pm {0.2} \)
-0.4\( {94.0} \pm {0.3} \)\( {81.7} \pm {0.9} \)\( {72.9} \pm {0.4} \)\( {73.5} \pm {1.0} \)\( {80.5} \pm {0.2} \)
-0.5\( {93.5} \pm {0.7} \)\( {80.7} \pm {1.6} \)\( {71.6} \pm {1.4} \)\( {73.6} \pm {1.5} \)\( {79.8} \pm {0.9} \)
-1.0\( {94.6} \pm {0.2} \)\( {81.6} \pm {1.2} \)\( {72.9} \pm {0.4} \)\( {77.0} \pm {1.6} \)\( {81.5} \pm {0.2} \)
-2.0\( {94.0} \pm {0.5} \)\( {79.5} \pm {1.2} \)\( {76.4} \pm {0.4} \)\( {73.9} \pm {1.7} \)\( {80.9} \pm {0.6} \)
\( {0.0} \rightarrow - {1.0} \)\( {94.1} \pm {0.4} \)\( {79.2} \pm {0.6} \)\( {74.1} \pm {1.1} \)\( {71.9} \pm {0.2} \)\( {79.8} \pm {0.3} \)
重量SCPACS平均
0.0\( {93.2} \pm {0.0} \)\( {80.4} \pm {1.0} \)\( {73.7} \pm {0.4} \)\( {72.6} \pm {2.6} \)\( {80.0} \pm {0.7} \)
-0.1\( {94.3} \pm {0.3} \)\( {78.8} \pm {0.3} \)\( {74.4} \pm {0.7} \)\( {75.3} \pm {1.6} \)\( {80.7} \pm {0.4} \)
-0.2\( {93.2} \pm {0.5} \)\( {76.8} \pm {1.5} \)\( {72.9} \pm {0.1} \)\( {71.8} \pm {1.0} \)\( {78.7} \pm {0.5} \)
-0.3\( {94.0} \pm {0.4} \)\( {79.8} \pm {1.0} \)\( {75.6} \pm {1.5} \)\( {73.9} \pm {1.0} \)\( {80.8} \pm {0.1} \)
-0.4\( {93.6} \pm {0.1} \)\( {79.8} \pm {0.5} \)\( {74.3} \pm {1.3} \)\( {75.7} \pm {2.3} \)\( {80.8} \pm {0.5} \)
-0.5\( {93.0} \pm {0.9} \)\( {79.4} \pm {1.6} \)\( {73.2} \pm {0.9} \)\( {75.5} \pm {1.0} \)\( {80.3} \pm {0.4} \)
-1.0\( {94.2} \pm {0.4} \)\( {80.2} \pm {1.1} \)\( {72.7} \pm {1.4} \)\( {68.6} \pm {0.6} \)\( {78.9} \pm {0.2} \)
-2.0\( {94.7} \pm {0.4} \)\( {78.7} \pm {0.5} \)\( {75.5} \pm {1.0} \)\( {71.1} \pm {2.2} \)\( {80.0} \pm {0.7} \)
\( {0.0} \rightarrow - {1.0} \)\( {94.3} \pm {0.5} \)\( {79.5} \pm {0.6} \)\( {74.2} \pm {0.3} \)\( {72.2} \pm {2.1} \)\( {80.0} \pm {0.4} \)
0.0\( {93.4} \pm {0.6} \)\( {80.5} \pm {0.8} \)\( {75.6} \pm {0.1} \)\( {74.3} \pm {2.0} \)\( {81.0} \pm {0.4} \)
-0.1\( {93.7} \pm {0.3} \)\( {83.2} \pm {1.4} \)\( {75.9} \pm {1.2} \)\( {71.1} \pm {1.9} \)\( {81.0} \pm {0.8} \)
-0.2\( {93.2} \pm {0.2} \)\( {81.0} \pm {0.7} \)\( {73.9} \pm {0.7} \)\( {75.0} \pm {0.6} \)\( {80.8} \pm {0.1} \)
-0.3\( {93.4} \pm {0.8} \)\( {81.4} \pm {0.4} \)\( {71.3} \pm {1.2} \)\( {76.9} \pm {0.9} \)\( {80.7} \pm {0.2} \)
-0.4\( {94.0} \pm {0.3} \)\( {81.7} \pm {0.9} \)\( {72.9} \pm {0.4} \)\( {73.5} \pm {1.0} \)\( {80.5} \pm {0.2} \)
-0.5\( {93.5} \pm {0.7} \)\( {80.7} \pm {1.6} \)\( {71.6} \pm {1.4} \)\( {73.6} \pm {1.5} \)\( {79.8} \pm {0.9} \)
-1.0\( {94.6} \pm {0.2} \)\( {81.6} \pm {1.2} \)\( {72.9} \pm {0.4} \)\( {77.0} \pm {1.6} \)\( {81.5} \pm {0.2} \)
-2.0\( {94.0} \pm {0.5} \)\( {79.5} \pm {1.2} \)\( {76.4} \pm {0.4} \)\( {73.9} \pm {1.7} \)\( {80.9} \pm {0.6} \)
\( {0.0} \rightarrow - {1.0} \)\( {94.1} \pm {0.4} \)\( {79.2} \pm {0.6} \)\( {74.1} \pm {1.1} \)\( {71.9} \pm {0.2} \)\( {79.8} \pm {0.3} \)

Table 5.8: Performance comparison for different negative class weights on the PACS dataset without (top) and with self-challenging (bottom) using training-domain validation and a ResNet-18 backbone. The feature and batch drop factors are kept constant with p=0.5 and b=13 ,a linear schedule from a to b throughout training is denoted with ab .

表5.8:在PACS数据集上,不使用(上方)和使用自我挑战(下方)时,不同负类权重的性能比较,采用训练域验证和ResNet-18骨干网络。特征和批次丢弃因子保持恒定,分别为p=0.5b=13,训练过程中采用从ab的线性调度,记为ab

Intra factor \( {\lambda }_{6} \)SC\( \mathbf{P} \)ACS\( \mathbf{{Avg}.} \)
0.0\( {94.2} \pm {0.4} \)\( {80.2} \pm {1.1} \)\( {72.7} \pm {1.4} \)\( {68.6} \pm {0.6} \)\( {78.9} \pm {0.2} \)
-0.1\( {94.1} \pm {0.4} \)\( {81.2} \pm {1.2} \)\( {73.6} \pm {0.8} \)\( {75.2} \pm {2.4} \)\( {81.0} \pm {0.6} \)
-0.2\( {94.5} \pm {0.4} \)\( {81.5} \pm {1.1} \)\( {74.4} \pm {1.6} \)\( {74.4} \pm {1.1} \)\( {81.2} \pm {0.5} \)
-0.5\( {92.9} \pm {0.8} \)\( {82.4} \pm {0.4} \)\( {73.5} \pm {1.6} \)\( {73.4} \pm {2.0} \)\( {80.6} \pm {0.7} \)
-1.0\( {94.7} \pm {0.4} \)\( {80.4} \pm {0.6} \)\( {73.9} \pm {0.9} \)\( {75.2} \pm {1.8} \)\( {81.1} \pm {0.7} \)
0.0\( {94.6} \pm {0.2} \)\( {81.6} \pm {1.2} \)\( {72.9} \pm {0.4} \)\( {77.0} \pm {1.6} \)\( {81.5} \pm {0.2} \)
-0.1\( {94.6} \pm {0.3} \)\( {82.6} \pm {0.9} \)\( {72.2} \pm {0.5} \)\( {75.4} \pm {0.1} \)\( {81.2} \pm {0.3} \)
-0.2\( {94.1} \pm {0.1} \)\( {81.9} \pm {0.2} \)\( {73.3} \pm {0.7} \)\( {75.8} \pm {0.9} \)\( {81.2} \pm {0.1} \)
-0.3\( {94.8} \pm {0.2} \)\( {81.5} \pm {0.1} \)\( {75.1} \pm {0.0} \)\( {74.5} \pm {0.4} \)\( {81.5} \pm {0.2} \)
-0.4\( {93.9} \pm {0.5} \)\( {82.4} \pm {0.2} \)\( {74.3} \pm {1.3} \)\( {76.1} \pm {1.1} \)\( {81.7} \pm {0.2} \)
-0.5\( {93.6} \pm {0.6} \)\( {82.1} \pm {0.9} \)\( {76.4} \pm {0.9} \)\( {76.3} \pm {0.6} \)\( {82.1} \pm {0.6} \)
-0.6\( {93.8} \pm {0.6} \)\( {82.1} \pm {0.3} \)\( {75.7} \pm {0.2} \)\( {73.1} \pm {3.1} \)\( {81.2} \pm {1.0} \)
-0.7\( {94.0} \pm {0.4} \)\( {82.9} \pm {1.2} \)\( {74.1} \pm {0.5} \)\( {76.3} \pm {1.0} \)\( {81.8} \pm {0.4} \)
-0.8\( {94.1} \pm {0.5} \)\( {81.8} \pm {0.3} \)\( {75.5} \pm {0.2} \)\( {72.7} \pm {0.3} \)\( {81.0} \pm {0.1} \)
-1.0\( {95.1} \pm {0.6} \)\( {80.7} \pm {1.5} \)\( {74.5} \pm {1.1} \)\( {73.7} \pm {1.2} \)\( {81.0} \pm {0.4} \)
-2.0\( {94.3} \pm {0.3} \)\( {82.9} \pm {0.5} \)\( {75.1} \pm {1.0} \)\( {75.0} \pm {1.1} \)\( {81.8} \pm {0.5} \)
-3.0\( {94.5} \pm {0.3} \)\( {79.8} \pm {1.0} \)\( {74.9} \pm {1.0} \)\( {71.2} \pm {0.4} \)\( {80.1} \pm {0.3} \)
-10.0\( {90.8} \pm {3.7} \)\( {80.2} \pm {0.3} \)\( {72.9} \pm {0.1} \)\( {75.0} \pm {1.8} \)\( {79.7} \pm {0.6} \)
-100.0\( {72.0} \pm {11} \) .\( {76.3} \pm {2.2} \)\( {73.0} \pm {1.1} \)\( {69.1} \pm {2.6} \)\( {72.6} \pm {4.0} \)
内因\( {\lambda }_{6} \)SC\( \mathbf{P} \)ACS\( \mathbf{{Avg}.} \)
0.0\( {94.2} \pm {0.4} \)\( {80.2} \pm {1.1} \)\( {72.7} \pm {1.4} \)\( {68.6} \pm {0.6} \)\( {78.9} \pm {0.2} \)
-0.1\( {94.1} \pm {0.4} \)\( {81.2} \pm {1.2} \)\( {73.6} \pm {0.8} \)\( {75.2} \pm {2.4} \)\( {81.0} \pm {0.6} \)
-0.2\( {94.5} \pm {0.4} \)\( {81.5} \pm {1.1} \)\( {74.4} \pm {1.6} \)\( {74.4} \pm {1.1} \)\( {81.2} \pm {0.5} \)
-0.5\( {92.9} \pm {0.8} \)\( {82.4} \pm {0.4} \)\( {73.5} \pm {1.6} \)\( {73.4} \pm {2.0} \)\( {80.6} \pm {0.7} \)
-1.0\( {94.7} \pm {0.4} \)\( {80.4} \pm {0.6} \)\( {73.9} \pm {0.9} \)\( {75.2} \pm {1.8} \)\( {81.1} \pm {0.7} \)
0.0\( {94.6} \pm {0.2} \)\( {81.6} \pm {1.2} \)\( {72.9} \pm {0.4} \)\( {77.0} \pm {1.6} \)\( {81.5} \pm {0.2} \)
-0.1\( {94.6} \pm {0.3} \)\( {82.6} \pm {0.9} \)\( {72.2} \pm {0.5} \)\( {75.4} \pm {0.1} \)\( {81.2} \pm {0.3} \)
-0.2\( {94.1} \pm {0.1} \)\( {81.9} \pm {0.2} \)\( {73.3} \pm {0.7} \)\( {75.8} \pm {0.9} \)\( {81.2} \pm {0.1} \)
-0.3\( {94.8} \pm {0.2} \)\( {81.5} \pm {0.1} \)\( {75.1} \pm {0.0} \)\( {74.5} \pm {0.4} \)\( {81.5} \pm {0.2} \)
-0.4\( {93.9} \pm {0.5} \)\( {82.4} \pm {0.2} \)\( {74.3} \pm {1.3} \)\( {76.1} \pm {1.1} \)\( {81.7} \pm {0.2} \)
-0.5\( {93.6} \pm {0.6} \)\( {82.1} \pm {0.9} \)\( {76.4} \pm {0.9} \)\( {76.3} \pm {0.6} \)\( {82.1} \pm {0.6} \)
-0.6\( {93.8} \pm {0.6} \)\( {82.1} \pm {0.3} \)\( {75.7} \pm {0.2} \)\( {73.1} \pm {3.1} \)\( {81.2} \pm {1.0} \)
-0.7\( {94.0} \pm {0.4} \)\( {82.9} \pm {1.2} \)\( {74.1} \pm {0.5} \)\( {76.3} \pm {1.0} \)\( {81.8} \pm {0.4} \)
-0.8\( {94.1} \pm {0.5} \)\( {81.8} \pm {0.3} \)\( {75.5} \pm {0.2} \)\( {72.7} \pm {0.3} \)\( {81.0} \pm {0.1} \)
-1.0\( {95.1} \pm {0.6} \)\( {80.7} \pm {1.5} \)\( {74.5} \pm {1.1} \)\( {73.7} \pm {1.2} \)\( {81.0} \pm {0.4} \)
-2.0\( {94.3} \pm {0.3} \)\( {82.9} \pm {0.5} \)\( {75.1} \pm {1.0} \)\( {75.0} \pm {1.1} \)\( {81.8} \pm {0.5} \)
-3.0\( {94.5} \pm {0.3} \)\( {79.8} \pm {1.0} \)\( {74.9} \pm {1.0} \)\( {71.2} \pm {0.4} \)\( {80.1} \pm {0.3} \)
-10.0\( {90.8} \pm {3.7} \)\( {80.2} \pm {0.3} \)\( {72.9} \pm {0.1} \)\( {75.0} \pm {1.8} \)\( {79.7} \pm {0.6} \)
-100.0\( {72.0} \pm {11} \) .\( {76.3} \pm {2.2} \)\( {73.0} \pm {1.1} \)\( {69.1} \pm {2.6} \)\( {72.6} \pm {4.0} \)

Table 5.9: Performance comparison for different intra factors on the PACS dataset without (top) and with self-challenging (bottom) using training-domain validation and a ResNet-18 backbone. The feature and batch drop factors are kept constant with p=0.5 and b=13 ,negative weight is-1.0. Distance metrics are weighted with λ2=1 and λϱ=1 .

表5.9:在PACS数据集上,不使用(上方)和使用自我挑战(下方)情况下,不同内部因素的性能比较,采用训练域验证和ResNet-18骨干网络。特征和批量丢弃因素保持恒定,分别为p=0.5b=13,负权重为-1.0。距离度量加权系数为λ2=1λϱ=1

Conclusion and Outlook

结论与展望

In this work, we investigated if we can deploy explainability methods during the training procedure and gain, both, better performance on the domain generalization task as well as a framework that enables more explainability for the users. In particular, we develop a regularization technique based on class activation maps that visualize parts of an image responsible for certain predictions (DivCAM) as well as prototypical representations that serve as a number of class or attribute centroids which the network uses to make its predictions (ProDROP and D-TRANSFORMERS).

在本工作中,我们探讨了是否可以在训练过程中部署可解释性方法,从而在域泛化任务上获得更好的性能,同时为用户提供一个更具可解释性的框架。具体而言,我们开发了一种基于类激活图(class activation maps,DivCAM)的正则化技术,该技术可视化图像中对特定预测负责的部分,以及基于原型表示的技术,这些原型作为若干类别或属性的中心点,网络利用它们进行预测(ProDROP和D-TRANSFORMERS)。

From the results and ablations presented in this work, we have shown that especially DivCAM is a reliable method for achieving domain generalization for small to medium sized ResNet backbones while offering an architecture that allows for additional insights into how and why the network arrives at it's predictions. Depending on the backbone and dataset, ProDrop and D-TRANSFORMERS can also be powerful options for this cause. The possibilities for explainability in these methods is a property that is highly desirable in practice, especially for safety-critical scenarios such as self-driving cars, any application in the medical field e.g. cancer or tumor prediction, or any other automation robot that needs to operate in a diverse set of environments. We hope that the presented methods can find application in such scenarios and establish additional trust and confidence into the machine learning systems to work reliable.

通过本工作中展示的结果和消融实验,我们证明了DivCAM尤其是一种可靠的方法,能够实现小到中等规模ResNet骨干网络的域泛化,同时提供一种架构,允许深入了解网络如何以及为何做出预测。根据骨干网络和数据集的不同,ProDrop和D-TRANSFORMERS也可能是实现该目标的有力选择。这些方法的可解释性特性在实际应用中极为重要,尤其是在安全关键场景,如自动驾驶汽车、医疗领域(例如癌症或肿瘤预测)或任何需要在多样环境中运行的自动化机器人。我们希望所提出的方法能在此类场景中得到应用,增强对机器学习系统的信任和可靠性。

Building upon the methods presented in this work, there are also quite a number of ablations or extension points that might be interesting to investigate. First, even though the results for ProDROP looked very promising on ResNet-18, it failed to generalize well enough to ResNet-50 to properly compete with the other methods in DomAINBED. Looking into some of the details coming from this transition, might yield a better understanding of what causes this problem and a solution can probably be found. Secondly, even though the explainability methods we build upon have very deep roots in the explainability literature, it would be interesting to either jointly train a suitable decoder for the prototypes or visualize the closest latent patch across the whole dataset. Since we're training with images coming from different domains, there could be interesting visualizations possible, potentially also in a video format that shows the change throughout training. In this work, we especially focused on achieving good performance with these methods rather than the explainability itself. Thirdly, many prototypical networks upscale the feature map in size,for example to 14×14 from 7×7 ,and report great performance gains coming from the increased spatial resolution of the latent representation [43]. We deliberately refrain from changing the backbone in such a way to improve comparability in the DOMAINBED framework without needing to re-compute the extensive set of baselines, although it might be possible that both ProDrop and D-TRANSFORMERS benefit more heavily from such a change compared to other methods. In a similar fashion, many prototypical networks use the euclidean distance as a distance measure while some works report better performance for the dot or cosine similarity [201]. We experimented with a few options across the different methods but a clear ablation study of this detail for the domain generalization task would be very helpful. In particular, one can also think about deploying any other Bregman divergence measure and taking a metric learning approach with, for example, the Mahalanobis distance [16] that might be able to achieve additional performance gains.

基于本工作中提出的方法,还有许多消融或扩展点值得进一步研究。首先,尽管ProDROP在ResNet-18上的结果非常有前景,但它未能在ResNet-50上实现足够的泛化,无法在DOMAINBED中与其他方法竞争。深入分析这一转变的细节,可能有助于更好地理解问题根源并找到解决方案。其次,尽管我们所依赖的可解释性方法在相关文献中有深厚基础,但联合训练适合的解码器以解码原型,或可视化整个数据集中最接近的潜在图像块,可能会带来有趣的视觉效果。由于训练图像来自不同域,可能实现有趣的可视化,甚至以视频形式展示训练过程中的变化。本工作重点在于利用这些方法实现良好性能,而非专注于可解释性本身。第三,许多原型网络会将特征图尺寸放大,例如从7×7放大到14×14,并报告由于潜在表示空间分辨率提升带来的显著性能提升[43]。我们有意避免修改骨干网络,以便在DOMAINBED框架中保持可比性,无需重新计算大量基线,尽管ProDrop和D-TRANSFORMERS可能更能从此类改变中受益。同样,许多原型网络使用欧氏距离作为度量,而部分研究报告点积或余弦相似度表现更佳[201]。我们在不同方法中尝试了几种选项,但针对域泛化任务的详细消融研究将非常有价值。特别地,也可以考虑部署其他Bregman散度度量,并采用度量学习方法,例如马氏距离(Mahalanobis distance)[16],以期获得额外性能提升。

Finally, especially for D-TRANSFORMERS, our method of aggregating across multiple environments is very simple in nature and finding a more elaborate way to utilize the multiple aligned prototypes for each environment might be a great opportunity for a follow-up work. Nevertheless, we hope that our analysis might serve as a first stepping stone and groundwork for developing more domain generalization algorithms based on explainability methods.

最后,尤其对于D-TRANSFORMERS,我们在多个环境间聚合的方法非常简单,寻找更复杂的方式利用每个环境的多个对齐原型,可能是后续工作的良好机会。尽管如此,我们希望我们的分析能作为基于可解释性方法开发更多域泛化算法的第一步和基础。

Bibliography

参考文献

[1] A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, and H. M. Wallach. "A Reductions Approach to Fair Classification". In: International Conference on Machine Learning, ICML. 2018.

[1] A. Agarwal, A. Beygelzimer, M. Dudík, J. Langford, 和 H. M. Wallach. “公平分类的归约方法”。发表于国际机器学习大会(ICML),2018年。

[2] K. Akuzawa, Y. Iwasawa, and Y. Matsuo. "Adversarial Invariant Feature Learning with Accuracy Constraint for Domain Generalization". In: European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD. 2019.

[2] K. Akuzawa, Y. Iwasawa, 和 Y. Matsuo. “带准确性约束的对抗不变特征学习用于域泛化”。发表于欧洲机器学习与知识发现数据库会议(ECML PKDD),2019年。

[3] G. Alain and Y. Bengio. "Understanding intermediate layers using linear classifier probes". In: International Conference on Learning Representations, ICLR. 2017.

[3] G. Alain 和 Y. Bengio. “使用线性分类器探针理解中间层”。发表于:国际学习表征会议(ICLR),2017年。

[4] E. A. AlBadawy, A. Saha, and M. A. Mazurowski. "Deep learning for segmentation of brain tumors: Impact of cross-institutional training and testing". In: Medical Physics 45.3 (2018), pp. 1150-1158.

[4] E. A. AlBadawy, A. Saha 和 M. A. Mazurowski. “脑肿瘤分割的深度学习:跨机构训练和测试的影响”。发表于:医学物理学,45卷第3期(2018),第1150-1158页。

[5] I. Albuquerque, J. Monteiro, M. Darvishi, T. H. Falk, and I. Mitliagkas. Generalizing to unseen domains via distribution matching. 2019. arXiv: 1911.00804 [cs.LG].

[5] I. Albuquerque, J. Monteiro, M. Darvishi, T. H. Falk 和 I. Mitliagkas. 通过分布匹配实现对未见领域的泛化。2019年。arXiv: 1911.00804 [cs.LG]。

[6] D. Alvarez-Melis and T. S. Jaakkola. "Towards Robust Interpretability with Self-Explaining Neural Networks". In: Advances in Neural Information Processing Systems, NeurIPS. 2018.

[6] D. Alvarez-Melis 和 T. S. Jaakkola. “迈向具有自解释能力的神经网络的稳健可解释性”。发表于:神经信息处理系统进展(NeurIPS),2018年。

[7] S. Amershi, M. Chickering, S. Drucker, B. Lee, P. Simard, and J. Suh. "ModelTracker: Redesigning Performance Analysis Tools for Machine Learning". In: Conference on Human Factors in Computing Systems, CHI. 2015.

[7] S. Amershi, M. Chickering, S. Drucker, B. Lee, P. Simard 和 J. Suh. “ModelTracker:重新设计机器学习性能分析工具”。发表于:人因计算系统会议(CHI),2015年。

[8] S. Anwar and N. Barnes. "Real Image Denoising With Feature Attention". In: International Conference on Computer Vision, ICCV. 2019.

[8] S. Anwar 和 N. Barnes. “基于特征注意力的真实图像去噪”。发表于:国际计算机视觉会议(ICCV),2019年。

[9] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz. Invariant Risk Minimization. 2019. arXiv: 1907.02893 [stat.ML].

[9] M. Arjovsky, L. Bottou, I. Gulrajani 和 D. Lopez-Paz. 不变风险最小化(Invariant Risk Minimization)。2019年。arXiv: 1907.02893 [stat.ML]。

[10] N. Asadi, A. M. Sarfi, M. Hosseinzadeh, Z. Karimpour, and M. Eftekhari. Towards Shape Biased Unsupervised Representation Learning for Domain Generalization. 2019. arXiv: 1909.08245 [cs.CV].

[10] N. Asadi, A. M. Sarfi, M. Hosseinzadeh, Z. Karimpour 和 M. Eftekhari. 面向领域泛化的形状偏置无监督表示学习。2019年。arXiv: 1909.08245 [cs.CV]。

[11] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. "On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation". In: PLOS ONE 10.7 (2015), e0130140.

[11] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller 和 W. Samek. “基于层次相关传播的非线性分类器决策的像素级解释”。发表于:PLOS ONE,10卷第7期(2015),e0130140。

[12] W. Bae, J. Noh, and G. Kim. "Rethinking Class Activation Mapping for Weakly Supervised Object Localization". In: European Conference on Computer Vision, ECCV. 2020.

[12] W. Bae, J. Noh 和 G. Kim. “重新思考弱监督目标定位的类激活映射”。发表于:欧洲计算机视觉会议(ECCV),2020年。

[13] D. Bahdanau, K. Cho, and Y. Bengio. "Neural Machine Translation by Jointly Learning to Align and Translate". In: International Conference on Learning Representations, ICLR. 2015.

[13] D. Bahdanau, K. Cho 和 Y. Bengio. “通过联合学习对齐与翻译的神经机器翻译”。发表于:国际学习表征会议(ICLR),2015年。

[14] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. "Unsupervised Domain Adaptation by Domain Invariant Projection". In: International Conference on Computer Vision, ICCV. 2013.

[14] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell 和 M. Salzmann. “通过领域不变投影实现无监督领域适应”。发表于:国际计算机视觉会议(ICCV),2013年。

[15] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. "MetaReg: Towards Domain Generalization using Meta-Regularization". In: Advances in Neural Information Processing Systems, NeurIPS. 2018.

[15] Y. Balaji, S. Sankaranarayanan 和 R. Chellappa. “MetaReg:基于元正则化的领域泛化方法”。发表于:神经信息处理系统进展(NeurIPS),2018年。

[16] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. "Clustering with Bregman Divergences". In: International Conference on Data Mining, SDM. 2004.

[16] A. Banerjee, S. Merugu, I. S. Dhillon 和 J. Ghosh. “基于Bregman散度的聚类”。发表于:国际数据挖掘会议(SDM),2004年。

[17] S. Beery, G. V. Horn, and P. Perona. "Recognition in Terra Incognita". In: European Conference on Computer Vision, ECCV. 2018.

[17] S. Beery, G. V. Horn 和 P. Perona. “在未知领域中的识别”。发表于:欧洲计算机视觉会议(ECCV),2018年。

[18] J. Bien and R. Tibshirani. "Prototype selection for interpretable classification". In: The Annals of Applied Statistics 5.4 (2011), pp. 2403-2424.

[18] J. Bien 和 R. Tibshirani. “用于可解释分类的原型选择”。发表于:应用统计年鉴,5卷第4期(2011),第2403-2424页。

[19] G. Blanchard, A. A. Deshmukh, U. Dogan, G. Lee, and C. Scott. Domain Generalization by Marginal Transfer Learning. 2017. arXiv: 1711.07910 [stat.ML].

[19] G. Blanchard, A. A. Deshmukh, U. Dogan, G. Lee, 和 C. Scott. 通过边际迁移学习实现领域泛化。2017。arXiv: 1711.07910 [stat.ML]。

[20] G. Blanchard, G. Lee, and C. Scott. "Generalizing from Several Related Classification Tasks to a New Unlabeled Sample". In: Advances in Neural Information Processing Systems, NIPS. 2011.

[20] G. Blanchard, G. Lee, 和 C. Scott. “从多个相关分类任务泛化到新的无标签样本”。载于:神经信息处理系统进展,NIPS。2011。

[21] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. "Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2017.

[21] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, 和 D. Krishnan. “基于生成对抗网络的无监督像素级领域自适应”。载于:计算机视觉与模式识别会议,CVPR。2017。

[22] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. "Domain Separation Networks". In: Advances in Neural Information Processing Systems, NIPS. 2016.

[22] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, 和 D. Erhan. “领域分离网络”。载于:神经信息处理系统进展,NIPS。2016。

[23] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. "Language Models are Few-Shot Learners". In: Advances in Neural Information Processing Systems, NeurIPS. 2020.

[23] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, 和 D. Amodei. “语言模型是少样本学习者”。载于:神经信息处理系统进展,NeurIPS。2020。

[24] X. Cai, J. Shang, Z. Jin, F. Liu, B. Qiang, W. Xie, and L. Zhao. "DBGE: Employee Turnover Prediction Based on Dynamic Bipartite Graph Embedding". In: IEEE Access 8 (2020), pp. 10390- 10402.

[24] X. Cai, J. Shang, Z. Jin, F. Liu, B. Qiang, W. Xie, 和 L. Zhao. “DBGE:基于动态二分图嵌入的员工流失预测”。载于:IEEE Access 8 (2020), 页码 10390-10402。

[25] T. Calders, F. Kamiran, and M. Pechenizkiy. "Building Classifiers with Independency Constraints". In: International Conference on Data Mining, ICDM. 2009.

[25] T. Calders, F. Kamiran, 和 M. Pechenizkiy. “构建具有独立性约束的分类器”。载于:国际数据挖掘会议,ICDM。2009。

[26] O. Camburu, T. Rocktäschel, T. Lukasiewicz, and P. Blunsom. "e-SNLI: Natural Language Inference with Natural Language Explanations". In: Advances in Neural Information Processing Systems, NeurIPS. 2018.

[26] O. Camburu, T. Rocktäschel, T. Lukasiewicz, 和 P. Blunsom. “e-SNLI:带有自然语言解释的自然语言推理”。载于:神经信息处理系统进展,NeurIPS。2018。

[27] O. Camburu, B. Shillingford, P. Minervini, T. Lukasiewicz, and P. Blunsom. "Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations". In: Annual Meeting of the Association for Computational Linguistics, ACL. 2020.

[27] O. Camburu, B. Shillingford, P. Minervini, T. Lukasiewicz, 和 P. Blunsom. “做出决定!不一致自然语言解释的对抗生成”。载于:计算语言学协会年会,ACL。2020。

[28] F. M. Carlucci, A. D'Innocente, S. Bucci, B. Caputo, and T. Tommasi. "Domain Generalization by Solving Jigsaw Puzzles". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2019.

[28] F. M. Carlucci, A. D'Innocente, S. Bucci, B. Caputo, 和 T. Tommasi. “通过拼图游戏实现领域泛化”。载于:计算机视觉与模式识别会议,CVPR。2019。

[29] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò. "AutoDIAL: Automatic Domain Alignment Layers". In: International Conference on Computer Vision, ICCV. 2017.

[29] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, 和 S. R. Bulò. “AutoDIAL:自动领域对齐层”。载于:国际计算机视觉会议,ICCV。2017。

[30] R. Caruana, S. Lawrence, and C. L. Giles. "Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping". In: Advances in Neural Information Processing Systems, NIPS. 2000.

[30] R. Caruana, S. Lawrence, 和 C. L. Giles. “神经网络中的过拟合:反向传播、共轭梯度和提前停止”。载于:神经信息处理系统进展,NIPS。2000。

[31] D. C. Castro, I. Walker, and B. Glocker. "Causality matters in medical imaging". In: Nature Communications 11.1 (2020).

[31] D. C. Castro, I. Walker, 和 B. Glocker. “因果关系在医学影像中的重要性”。载于:自然通讯 11.1 (2020)。

[32] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. Su. "This Looks Like That: Deep Learning for Interpretable Image Recognition". In: Advances in Neural Information Processing Systems, NeurIPS. 2019.

[32] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, 和 J. Su. “这看起来像那个:用于可解释图像识别的深度学习”。载于:神经信息处理系统进展,NeurIPS。2019。

[33] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky. "Exploiting hierarchical context on a large database of object categories". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2010.

[33] M. J. Choi, J. J. Lim, A. Torralba, 和 A. S. Willsky. “利用大型物体类别数据库中的层次上下文”。载于:计算机视觉与模式识别会议,CVPR。2010。

[34] G. Csurka. "A Comprehensive Survey on Domain Adaptation for Visual Applications". In: Domain Adaptation in Computer Vision Applications. 2017.

[34] G. Csurka. “视觉应用领域自适应的综合调研”。载于:《计算机视觉应用中的领域自适应》。2017年。

[35] A. D'Innocente and B. Caputo. "Domain Generalization with Domain-Specific Aggregation Modules". In: German Conference on Pattern Recognition, GCPR. 2018.

[35] A. D'Innocente 和 B. Caputo。“具有领域特定聚合模块的领域泛化”。载于:德国模式识别会议,GCPR。2018年。

[36] D. Dai and L. V. Gool. "Dark Model Adaptation: Semantic Image Segmentation from Daytime to Nighttime". In: International Conference on Intelligent Transportation Systems, ITSC. 2018.

[36] D. Dai 和 L. V. Gool。“暗模型适应:从白天到夜晚的语义图像分割”。载于:智能交通系统国际会议,ITSC。2018年。

[37] Z. Deng, F. Ding, C. Dwork, R. Hong, G. Parmigiani, P. Patil, and P. Sur. Representation via Representations: Domain Generalization via Adversarially Learned Invariant Representations. 2020. arXiv: 2006.11478 [cs.LG].

[37] Z. Deng, F. Ding, C. Dwork, R. Hong, G. Parmigiani, P. Patil 和 P. Sur。通过表示实现表示:基于对抗学习的不变表示的领域泛化。2020年。arXiv: 2006.11478 [cs.LG]。

[38] A. A. Deshmukh, Y. Lei, S. Sharma, U. Dogan, J. W. Cutler, and C. Scott. A Generalization Error Bound for Multi-class Domain Generalization. 2019. arXiv: 1905.10392 [stat.ML].

[38] A. A. Deshmukh, Y. Lei, S. Sharma, U. Dogan, J. W. Cutler 和 C. Scott。多类领域泛化的一般化误差界。2019年。arXiv: 1905.10392 [stat.ML]。

[39] J. Devlin, M. Chang, K. Lee, and K. Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 2019.

[39] J. Devlin, M. Chang, K. Lee 和 K. Toutanova。“BERT:用于语言理解的深度双向Transformer预训练”。载于:北美计算语言学协会人类语言技术会议,NAACL-HLT。2019年。

[40] T. DeVries and G. W. Taylor. Improved Regularization of Convolutional Neural Networks with Cutout. 2017. arXiv: 1708.04552 [cs.CV].

[40] T. DeVries 和 G. W. Taylor。使用Cutout改进卷积神经网络的正则化。2017年。arXiv: 1708.04552 [cs.CV]。

[41] Y. Ding, Y. Liu, H. Luan, and M. Sun. "Visualizing and Understanding Neural Machine Translation". In: Annual Meeting of the Association for Computational Linguistics, ACL. 2017.

[41] Y. Ding, Y. Liu, H. Luan 和 M. Sun。“神经机器翻译的可视化与理解”。载于:计算语言学协会年会,ACL。2017年。

[42] Z. Ding and Y. Fu. "Deep Domain Generalization With Structured Low-Rank Constraint". In: IEEE Transactions on Image Processing 27.1 (2018), pp. 304-313.

[42] Z. Ding 和 Y. Fu。“具有结构化低秩约束的深度领域泛化”。载于:IEEE图像处理汇刊 27卷1期 (2018),第304-313页。

[43] C. Doersch, A. Gupta, and A. Zisserman. "CrossTransformers: spatially-aware few-shot transfer". In: Advances in Neural Information Processing Systems, NeurIPS. 2020.

[43] C. Doersch, A. Gupta 和 A. Zisserman。“CrossTransformers:空间感知的少样本迁移”。载于:神经信息处理系统大会,NeurIPS。2020年。

[44] Y. Dong, H. Su, J. Zhu, and B. Zhang. "Improving Interpretability of Deep Neural Networks with Semantic Information". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2017.

[44] Y. Dong, H. Su, J. Zhu 和 B. Zhang。“利用语义信息提升深度神经网络的可解释性”。载于:计算机视觉与模式识别会议,CVPR。2017年。

[45] M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. "Empirical Risk Minimization Under Fairness Constraints". In: Advances in Neural Information Processing Systems, NeurIPS. 2018.

[45] M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor 和 M. Pontil。“在公平性约束下的经验风险最小化”。载于:神经信息处理系统大会,NeurIPS。2018年。

[46] Q. Dou, D. C. de Castro, K. Kamnitsas, and B. Glocker. "Domain Generalization via Model-Agnostic Learning of Semantic Features". In: Advances in Neural Information Processing Systems, NeurIPS. 2019.

[46] Q. Dou, D. C. de Castro, K. Kamnitsas 和 B. Glocker。“通过模型无关的语义特征学习实现领域泛化”。载于:神经信息处理系统大会,NeurIPS。2019年。

[47] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. "Fairness through awareness". In: Innovations in Theoretical Computer Science, ITCS. 2012.

[47] C. Dwork, M. Hardt, T. Pitassi, O. Reingold 和 R. S. Zemel。“通过意识实现公平”。载于:理论计算机科学创新会议,ITCS。2012年。

[48] C. Dwork, N. Immorlica, A. T. Kalai, and M. D. M. Leiserson. "Decoupled Classifiers for Group-Fair and Efficient Machine Learning". In: Conference on Fairness, Accountability and Transparency, FAT. 2018.

[48] C. Dwork, N. Immorlica, A. T. Kalai 和 M. D. M. Leiserson。“用于群体公平与高效机器学习的解耦分类器”。载于:公平性、问责制与透明度会议,FAT。2018年。

[49] M. Eitz, J. Hays, and M. Alexa. "How do humans sketch objects?" In: Transactions on Graphics, TOG 31.4 (2012), 44:1-44:10.

[49] M. Eitz, J. Hays 和 M. Alexa。“人类如何绘制物体草图?”载于:图形学汇刊,TOG 31卷4期 (2012),44:1-44:10。

[50] E. R. Elenberg, A. G. Dimakis, M. Feldman, and A. Karbasi. "Streaming Weak Submodularity: Interpreting Neural Networks on the Fly". In: Advances in Neural Information Processing Systems, NIPS. 2017.

[50] E. R. Elenberg, A. G. Dimakis, M. Feldman, 和 A. Karbasi. “流式弱子模性:即时解释神经网络”。载于:神经信息处理系统进展,NIPS。2017年。

[51] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. "The Pascal Visual Object Classes (VOC) Challenge". In: International Journal of Computer Vision, IJCV 88.2 (2010), pp. 303-338.

[51] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, 和 A. Zisserman. “Pascal视觉对象类别(VOC)挑战”。载于:国际计算机视觉杂志,IJCV 88.2 (2010), 页303-338。

[52] C. Fang, Y. Xu, and D. N. Rockmore. "Unbiased Metric Learning: On the Utilization of Multiple Datasets and Web Images for Softening Bias". In: International Conference on Computer Vision, ICCV. 2013.

[52] C. Fang, Y. Xu, 和 D. N. Rockmore. “无偏度量学习:关于利用多个数据集和网络图像以减轻偏差”。载于:国际计算机视觉会议,ICCV。2013年。

[53] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. "Certifying and Removing Disparate Impact". In: International Conference on Knowledge Discovery and Data Mining, SIGKDD. 2015.

[53] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, 和 S. Venkatasubramanian. “认证与消除差异影响”。载于:知识发现与数据挖掘国际会议,SIGKDD。2015年。

[54] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. "Unsupervised Visual Domain Adaptation Using Subspace Alignment". In: International Conference on Computer Vision, ICCV. 2013.

[54] B. Fernando, A. Habrard, M. Sebban, 和 T. Tuytelaars. “基于子空间对齐的无监督视觉域适应”。载于:国际计算机视觉会议,ICCV。2013年。

[55] S. Fidler, S. J. Dickinson, and R. Urtasun. "3D Object Detection and Viewpoint Estimation with a Deformable 3D Cuboid Model". In: Advances in Neural Information Processing Systems, NIPS. 2012.

[55] S. Fidler, S. J. Dickinson, 和 R. Urtasun. “基于可变形三维立方体模型的三维物体检测与视角估计”。载于:神经信息处理系统进展,NIPS。2012年。

[56] C. Finn, P. Abbeel, and S. Levine. "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks". In: International Conference on Machine Learning, ICML. 2017.

[56] C. Finn, P. Abbeel, 和 S. Levine. “模型无关元学习(Model-Agnostic Meta-Learning)用于深度网络的快速适应”。载于:国际机器学习会议,ICML。2017年。

[57] R. C. Fong and A. Vedaldi. "Interpretable Explanations of Black Boxes by Meaningful Perturbation". In: International Conference on Computer Vision, ICCV. 2017.

[57] R. C. Fong 和 A. Vedaldi. “通过有意义的扰动对黑盒模型进行可解释性说明”。载于:国际计算机视觉会议,ICCV。2017年。

[58] N. Frosst and G. E. Hinton. "Distilling a Neural Network Into a Soft Decision Tree". In: International Workshop on Comprehensibility and Explanation in AI and ML, CEX. 2017.

[58] N. Frosst 和 G. E. Hinton. “将神经网络蒸馏为软决策树”。载于:人工智能与机器学习可解释性研讨会,CEX。2017年。

[59] F. B. Fuchs, O. Groth, A. R. Kosiorek, A. Bewley, M. Wulfmeier, A. Vedaldi, and I. Posner. "Scrutinizing and De-Biasing Intuitive Physics with Neural Stethoscopes". In: British Machine Vision Conference, BMVC. 2019.

[59] F. B. Fuchs, O. Groth, A. R. Kosiorek, A. Bewley, M. Wulfmeier, A. Vedaldi, 和 I. Posner. “利用神经听诊器审视并去偏直觉物理学”。载于:英国机器视觉会议,BMVC。2019年。

[60] Y. Ganin and V. S. Lempitsky. "Unsupervised Domain Adaptation by Backpropagation". In: International Conference on Machine Learning, ICML. 2015.

[60] Y. Ganin 和 V. S. Lempitsky. “通过反向传播实现无监督域适应”。载于:国际机器学习会议,ICML。2015年。

[61] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky. "Domain-Adversarial Training of Neural Networks". In: Journal of Machine Learning Research, JMLR 17 (2016), 59:1-59:35.

[61] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, 和 V. S. Lempitsky. “神经网络的域对抗训练”。载于:机器学习研究杂志,JMLR 17 (2016), 59:1-59:35。

[62] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. "ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness". In: International Conference on Learning Representations, ICLR. 2019.

[62] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, 和 W. Brendel. “ImageNet训练的卷积神经网络偏向纹理;增强形状偏向提升准确性和鲁棒性”。载于:学习表征国际会议,ICLR。2019年。

[63] M. Gharib, P. Lollini, M. Botta, E. G. Amparore, S. Donatelli, and A. Bondavalli. "On the Safety of Automotive Systems Incorporating Machine Learning Based Components: A Position Paper". In: International Conference on Dependable Systems and Networks, DSN. 2018.

[63] M. Gharib, P. Lollini, M. Botta, E. G. Amparore, S. Donatelli, 和 A. Bondavalli. “关于包含基于机器学习组件的汽车系统安全性的立场论文”。载于:可靠系统与网络国际会议,DSN。2018年。

[64] G. Ghiasi, T. Lin, and Q. V. Le. "DropBlock: A regularization method for convolutional networks". In: Advances in Neural Information Processing Systems, NeurIPS. 2018.

[64] G. Ghiasi, T. Lin, 和 Q. V. Le. “DropBlock:卷积网络的正则化方法”。载于:神经信息处理系统进展,NeurIPS。2018年。

[65] M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang. "Scatter Component Analysis: A Unified Framework for Domain Adaptation and Domain Generalization". In: Transactions on Pattern Analysis and Machine Intelligence, PAMI 39.7 (2017), pp. 1414-1430.

[65] M. Ghifary, D. Balduzzi, W. B. Kleijn, 和 M. Zhang. “散射成分分析:域适应与域泛化的统一框架”。载于:模式分析与机器智能汇刊,PAMI 39.7 (2017), 页1414-1430。

[66] M. Ghifary, W. B. Kleijn, M. Zhang, and D. Balduzzi. "Domain Generalization for Object Recognition with Multi-task Autoencoders". In: International Conference on Computer Vision, ICCV. 2015.

[66] M. Ghifary, W. B. Kleijn, M. Zhang, 和 D. Balduzzi. “基于多任务自编码器的目标识别领域泛化”。发表于:国际计算机视觉大会(ICCV),2015年。

[67] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. "Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation". In: European Conference on Computer Vision, ECCV. 2016.

[67] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, 和 W. Li. “用于无监督领域适应的深度重建-分类网络”。发表于:欧洲计算机视觉大会(ECCV),2016年。

[68] B. Gong, K. Grauman, and F. Sha. "Connecting the Dots with Landmarks: Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation". In: International Conference on Machine Learning, ICML. 2013.

[68] B. Gong, K. Grauman, 和 F. Sha. “通过地标连接点:判别式学习无监督领域适应的领域不变特征”。发表于:国际机器学习大会(ICML),2013年。

[69] B. Gong, Y. Shi, F. Sha, and K. Grauman. "Geodesic flow kernel for unsupervised domain adaptation". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2012.

[69] B. Gong, Y. Shi, F. Sha, 和 K. Grauman. “用于无监督领域适应的测地流核(Geodesic flow kernel)”。发表于:计算机视觉与模式识别会议(CVPR),2012年。

[70] P. Gordaliza, E. del Barrio, F. Gamboa, and J. Loubes. "Obtaining Fairness using Optimal Transport Theory". In: International Conference on Machine Learning, ICML. 2019.

[70] P. Gordaliza, E. del Barrio, F. Gamboa, 和 J. Loubes. “利用最优传输理论实现公平性”。发表于:国际机器学习大会(ICML),2019年。

[71] A. Graves, G. Wayne, and I. Danihelka. Neural Turing Machines. 2014. arXiv: 1410.5401 [cs.NE].

[71] A. Graves, G. Wayne, 和 I. Danihelka. 神经图灵机(Neural Turing Machines)。2014年。arXiv: 1410.5401 [cs.NE]。

[72] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. "A Kernel Two-Sample Test". In: Journal of Machine Learning Research, JMLR 13 (2012), pp. 723-773.

[72] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, 和 A. J. Smola. “核两样本检验”。发表于:机器学习研究杂志(JMLR)13卷(2012年),第723-773页。

[73] I. Gulrajani and D. Lopez-Paz. "In Search of Lost Domain Generalization". In: International Conference on Learning Representations, ICLR. 2021.

[73] I. Gulrajani 和 D. Lopez-Paz. “寻找失落的领域泛化”。发表于:国际学习表征会议(ICLR),2021年。

[74] Z. Guo, T. He, Z. Qin, Z. Xie, and J. Liu. "A Content-Based Recommendation Framework for Judicial Cases". In: International Conference of Pioneering Computer Scientists, Engineers and Educators, ICPCSEE. 2019.

[74] Z. Guo, T. He, Z. Qin, Z. Xie, 和 J. Liu. “基于内容的司法案例推荐框架”。发表于:先驱计算机科学家、工程师与教育者国际会议(ICPCSEE),2019年。

[75] M. Hardt, E. Price, and N. Srebro. "Equality of Opportunity in Supervised Learning". In: Advances in Neural Information Processing Systems, NIPS. 2016.

[75] M. Hardt, E. Price, 和 N. Srebro. “监督学习中的机会平等”。发表于:神经信息处理系统大会(NIPS),2016年。

[76] M. Harradon, J. Druce, and B. Ruttenberg. Causal Learning and Explanation of Deep Neural Networks via Autoencoded Activations. 2018. arXiv: 1802.00541 [cs.AI].

[76] M. Harradon, J. Druce, 和 B. Ruttenberg. 通过自编码激活实现深度神经网络的因果学习与解释。2018年。arXiv: 1802.00541 [cs.AI]。

[77] K. He, X. Zhang, S. Ren, and J. Sun. "Deep Residual Learning for Image Recognition". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2016.

[77] K. He, X. Zhang, S. Ren, 和 J. Sun. “用于图像识别的深度残差学习”。发表于:计算机视觉与模式识别会议(CVPR),2016年。

[78] H. Heidari, C. Ferrari, K. P. Gummadi, and A. Krause. "Fairness Behind a Veil of Ignorance: A Welfare Analysis for Automated Decision Making". In: Advances in Neural Information Processing Systems, NeurIPS. 2018.

[78] H. Heidari, C. Ferrari, K. P. Gummadi, 和 A. Krause. “无知之幕下的公平性:自动决策的福利分析”。发表于:神经信息处理系统大会(NeurIPS),2018年。

[79] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. "Generating Visual Explanations". In: European Conference on Computer Vision, ECCV. 2016.

[79] L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, 和 T. Darrell. “生成视觉解释”。发表于:欧洲计算机视觉大会(ECCV),2016年。

[80] D. Hendrycks and T. G. Dietterich. "Benchmarking Neural Network Robustness to Common Corruptions and Perturbations". In: International Conference on Learning Representations, ICLR. 2019.

[80] D. Hendrycks 和 T. G. Dietterich. “神经网络对常见损坏和扰动的鲁棒性基准测试”。发表于:国际学习表征会议(ICLR),2019年。

[81] M. Hind, D. Wei, M. Campbell, N. C. F. Codella, A. Dhurandhar, A. Mojsilovic, K. N. Rama-murthy, and K. R. Varshney. "TED: Teaching AI to Explain its Decisions". In: Conference on AI, Ethics, and Society, AIES. 2019.

[81] M. Hind, D. Wei, M. Campbell, N. C. F. Codella, A. Dhurandhar, A. Mojsilovic, K. N. Ramamurthy, 和 K. R. Varshney. “TED:教人工智能解释其决策”。发表于:人工智能、伦理与社会会议(AIES),2019年。

[82] B. Hou and Z. Zhou. "Learning With Interpretable Structure From Gated RNN". In: Transactions on Neural Networks and Learning Systems 31.7 (2020), pp. 2267-2279.

[82] B. Hou 和 Z. Zhou. “从门控循环神经网络(Gated RNN)中学习可解释结构”。发表于:《神经网络与学习系统汇刊》(Transactions on Neural Networks and Learning Systems)31卷7期(2020年),第2267-2279页。

[83] S. Hu, K. Zhang, Z. Chen, and L. Chan. "Domain Generalization via Multidomain Discriminant Analysis". In: Conference on Uncertainty in Artificial Intelligence, UAI. 2019.

[83] S. Hu, K. Zhang, Z. Chen 和 L. Chan. “通过多域判别分析实现领域泛化”。发表于:人工智能不确定性会议(Conference on Uncertainty in Artificial Intelligence, UAI),2019年。

[84] W. Hu, G. Niu, I. Sato, and M. Sugiyama. "Does Distributionally Robust Supervised Learning Give Robust Classifiers?" In: International Conference on Machine Learning, ICML. 2018.

[84] W. Hu, G. Niu, I. Sato 和 M. Sugiyama. “分布鲁棒监督学习是否能产生鲁棒分类器?”发表于:国际机器学习大会(International Conference on Machine Learning, ICML),2018年。

[85] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf. "Correcting Sample Selection Bias by Unlabeled Data". In: Advances in Neural Information Processing Systems, NIPS. 2006.

[85] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt 和 B. Schölkopf. “通过无标签数据校正样本选择偏差”。发表于:神经信息处理系统进展会议(Advances in Neural Information Processing Systems, NIPS),2006年。

[86] Z. Huang, H. Wang, E. P. Xing, and D. Huang. "Self-Challenging Improves Cross-Domain Generalization". In: European Conference on Computer Vision, ECCV. 2020.

[86] Z. Huang, H. Wang, E. P. Xing 和 D. Huang. “自我挑战提升跨域泛化能力”。发表于:欧洲计算机视觉大会(European Conference on Computer Vision, ECCV),2020年。

[87] D. A. Hudson and C. D. Manning. "Compositional Attention Networks for Machine Reasoning". In: International Conference on Learning Representations, ICLR. 2018.

[87] D. A. Hudson 和 C. D. Manning. “用于机器推理的组合注意力网络”。发表于:国际表征学习会议(International Conference on Learning Representations, ICLR),2018年。

[88] M. Ilse, J. M. Tomczak, C. Louizos, and M. Welling. "DIVA: Domain Invariant Variational Autoencoders". In: International Conference on Learning Representations, ICLR. 2019.

[88] M. Ilse, J. M. Tomczak, C. Louizos 和 M. Welling. “DIVA:领域不变变分自编码器”。发表于:国际表征学习会议(International Conference on Learning Representations, ICLR),2019年。

[89] S. Ioffe and C. Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". In: International Conference on Machine Learning, ICML. 2015.

[89] S. Ioffe 和 C. Szegedy. “批量归一化:通过减少内部协变量偏移加速深度网络训练”。发表于:国际机器学习大会(International Conference on Machine Learning, ICML),2015年。

[90] R. Iyer, Y. Li, H. Li, M. Lewis, R. Sundar, and K. P. Sycara. "Transparency and Explanation in Deep Reinforcement Learning Neural Networks". In: Conference on AI, Ethics, and Society, AIES. 2018.

[90] R. Iyer, Y. Li, H. Li, M. Lewis, R. Sundar, 和 K. P. Sycara. “深度强化学习神经网络中的透明性与解释性”。发表于:人工智能、伦理与社会会议(AIES),2018年。

[91] S. Jain and B. C. Wallace. "Attention is not Explanation". In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 2019.

[91] S. Jain 和 B. C. Wallace. “注意力机制不是解释”。发表于:北美计算语言学协会人类语言技术分会会议(NAACL-HLT),2019年。

[92] Y. Jia, J. Zhang, S. Shan, and X. Chen. "Single-Side Domain Generalization for Face Anti-Spoofing". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2020.

[92] Y. Jia, J. Zhang, S. Shan, 和 X. Chen. “面部反欺骗的单侧域泛化”。发表于:计算机视觉与模式识别会议(CVPR),2020年。

[93] H. Jiang, B. Kim, M. Y. Guan, and M. R. Gupta. "To Trust Or Not To Trust A Classifier". In: Advances in Neural Information Processing Systems, NeurIPS. 2018.

[93] H. Jiang, B. Kim, M. Y. Guan, 和 M. R. Gupta. “信任还是不信任分类器”。发表于:神经信息处理系统大会(NeurIPS),2018年。

[94] X. Jin, C. Lan, W. Zeng, and Z. Chen. Feature Alignment and Restoration for Domain Generalization and Adaptation. 2020. arXiv: 2006.12009 [cs.CV].

[94] X. Jin, C. Lan, W. Zeng, 和 Z. Chen. 特征对齐与恢复用于域泛化与适应。2020年。arXiv: 2006.12009 [cs.CV]。

[95] D. Kang, D. Raghavan, P. Bailis, and M. Zaharia. "Model Assertions for Monitoring and Improving ML Models". In: Conference on Machine Learning and Systems, MLSys. 2020.

[95] D. Kang, D. Raghavan, P. Bailis, 和 M. Zaharia. “用于监控和改进机器学习模型的模型断言”。发表于:机器学习与系统会议(MLSys),2020年。

[96] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba. "Undoing the Damage of Dataset Bias". In: European Conference on Computer Vision, ECCV. 2012.

[96] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, 和 A. Torralba. “消除数据集偏差的影响”。发表于:欧洲计算机视觉会议(ECCV),2012年。

[97] B. Kim, C. Rudin, and J. A. Shah. "The Bayesian Case Model: A Generative Approach for Case-Based Reasoning and Prototype Classification". In: Advances in Neural Information Processing Systems, NIPS. 2014.

[97] B. Kim, C. Rudin, 和 J. A. Shah. “贝叶斯案例模型:一种基于案例推理和原型分类的生成方法”。发表于:神经信息处理系统大会(NIPS),2014年。

[98] D. P. Kingma and J. L. Ba. "Adam: A Method for Stochastic Optimization". In: International Conference on Learning Representations, ICLR. 2015.

[98] D. P. Kingma 和 J. L. Ba. “Adam:一种随机优化方法”。发表于:国际学习表征会议(ICLR),2015年。

[99] D. P. Kingma and M. Welling. "Auto-Encoding Variational Bayes". In: International Conference on Learning Representations, ICLR. 2014.

[99] D. P. Kingma 和 M. Welling. “自动编码变分贝叶斯”。发表于:国际学习表征会议(ICLR),2014年。

[100] A. Krizhevsky, I. Sutskever, and G. E. Hinton. "ImageNet Classification with Deep Convolutional Neural Networks". In: Advances in Neural Information Processing Systems, NIPS. 2012.

[100] A. Krizhevsky, I. Sutskever, 和 G. E. Hinton. “基于深度卷积神经网络的ImageNet分类”。发表于:神经信息处理系统大会(NIPS),2012年。

[101] D. Krueger, E. Caballero, J.-H. Jacobsen, A. Zhang, J. Binas, R. L. Priol, and A. Courville. Out-of-Distribution Generalization via Risk Extrapolation (REx). 2020. arXiv: 2003.00688 [cs.LG].

[101] D. Krueger, E. Caballero, J.-H. Jacobsen, A. Zhang, J. Binas, R. L. Priol, 和 A. Courville. 通过风险外推(REx)实现分布外泛化。2020年。arXiv: 2003.00688 [cs.LG]。

[102] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". In: International Conference on Learning Representations, ICLR. 2019.

[102] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, 和 R. Soricut. “ALBERT:一种轻量级BERT用于语言表示的自监督学习”。发表于:国际学习表征会议(ICLR),2019年。

[103] S. Lapuschkin, A. Binder, G. Montavon, K. Müller, and W. Samek. "Analyzing Classifiers: Fisher Vectors and Deep Neural Networks". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2016.

[103] S. Lapuschkin, A. Binder, G. Montavon, K. Müller, 和 W. Samek. “分类器分析:费舍尔向量与深度神经网络”。发表于:计算机视觉与模式识别会议(CVPR),2016年。

[104] G. Larsson, M. Maire, and G. Shakhnarovich. "FractalNet: Ultra-Deep Neural Networks without Residuals". In: International Conference on Learning Representations, ICLR. 2017.

[104] G. Larsson, M. Maire, 和 G. Shakhnarovich. “FractalNet:无残差的超深神经网络”。发表于:国际学习表征会议(ICLR),2017年。

[105] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.

[105] Y. LeCun 和 C. Cortes. MNIST手写数字数据库。2010年。

[106] T. Lei, R. Barzilay, and T. S. Jaakkola. "Rationalizing Neural Predictions". In: Conference on Empirical Methods in Natural Language Processing, EMNLP. 2016.

[106] T. Lei, R. Barzilay, 和 T. S. Jaakkola. “合理化神经预测”。载于:自然语言处理实证方法会议(EMNLP),2016年。

[107] D. Li, Y. Yang, Y. Song, and T. M. Hospedales. "Deeper, Broader and Artier Domain Generalization". In: International Conference on Computer Vision, ICCV. 2017.

[107] D. Li, Y. Yang, Y. Song, 和 T. M. Hospedales. “更深、更广、更艺术的领域泛化”。载于:国际计算机视觉会议(ICCV),2017年。

[108] D. Li, Y. Yang, Y. Song, and T. M. Hospedales. "Learning to Generalize: Meta-Learning for Domain Generalization". In: AAAI Conference on Artificial Intelligence, AAAI. 2018.

[108] D. Li, Y. Yang, Y. Song, 和 T. M. Hospedales. “学习泛化:领域泛化的元学习”。载于:美国人工智能协会会议(AAAI),2018年。

[109] D. Li, Y. Yang, Y.-Z. Song, and T. Hospedales. "Sequential Learning for Domain Generalization". In: European Conference on Computer Vision, ECCV. 2020.

[109] D. Li, Y. Yang, Y.-Z. Song, 和 T. Hospedales. “领域泛化的序列学习”。载于:欧洲计算机视觉会议(ECCV),2020年。

[110] D. Li, J. Zhang, Y. Yang, C. Liu, Y. Song, and T. M. Hospedales. "Episodic Training for Domain Generalization". In: International Conference on Computer Vision, ICCV. 2019.

[110] D. Li, J. Zhang, Y. Yang, C. Liu, Y. Song, 和 T. M. Hospedales. “领域泛化的情景训练”。载于:国际计算机视觉会议(ICCV),2019年。

[111] F. Li, R. Fergus, and P. Perona. "Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories". In: Computer Vision and Image Understanding 106.1 (2007), pp. 59-70.

[111] F. Li, R. Fergus, 和 P. Perona. “从少量训练样本学习生成视觉模型:一种增量贝叶斯方法,在101个物体类别上测试”。载于:《计算机视觉与图像理解》106.1 (2007), 页59-70。

[112] H. Li, S. J. Pan, S. Wang, and A. C. Kot. "Domain Generalization With Adversarial Feature Learning". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2018.

[112] H. Li, S. J. Pan, S. Wang, 和 A. C. Kot. “通过对抗特征学习实现领域泛化”。载于:计算机视觉与模式识别会议(CVPR),2018年。

[113] J. Li, W. Monroe, and D. Jurafsky. Understanding Neural Networks through Representation Erasure. 2016. arXiv: 1612.08220 [cs.CL].

[113] J. Li, W. Monroe, 和 D. Jurafsky. 通过表示擦除理解神经网络。2016年。arXiv: 1612.08220 [cs.CL]。

[114] O. Li, H. Liu, C. Chen, and C. Rudin. "Deep Learning for Case-Based Reasoning Through Prototypes: A Neural Network That Explains Its Predictions". In: AAAI Conference on Artificial Intelligence, AAAI. 2018.

[114] O. Li, H. Liu, C. Chen, 和 C. Rudin. “通过原型进行基于案例推理的深度学习:一种能解释其预测的神经网络”。载于:美国人工智能协会会议(AAAI),2018年。

[115] Y. Li, M. Gong, X. Tian, T. Liu, and D. Tao. "Domain Generalization via Conditional Invariant Representations". In: AAAI Conference on Artificial Intelligence, AAAI. 2018.

[115] Y. Li, M. Gong, X. Tian, T. Liu, 和 D. Tao. “通过条件不变表示实现领域泛化”。载于:美国人工智能协会会议(AAAI),2018年。

[116] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, and D. Tao. "Deep Domain Generalization via Conditional Invariant Adversarial Networks". In: European Conference on Computer Vision, ECCV . 2018.

[116] Y. Li, X. Tian, M. Gong, Y. Liu, T. Liu, K. Zhang, 和 D. Tao. “通过条件不变对抗网络实现深度领域泛化”。载于:欧洲计算机视觉会议(ECCV),ECCV,2018年。

[117] Y. Li, Y. Yang, W. Zhou, and T. M. Hospedales. "Feature-Critic Networks for Heterogeneous Domain Generalization". In: International Conference on Machine Learning, ICML. 2019.

[117] Y. Li, Y. Yang, W. Zhou, 和 T. M. Hospedales. “异构领域泛化的特征批评网络”。载于:国际机器学习会议(ICML),2019年。

[118] M. Lin, Q. Chen, and S. Yan. "Network In Network". In: International Conference on Learning Representations, ICLR. 2014.

[118] M. Lin, Q. Chen, 和 S. Yan. “网络中的网络”。载于:国际学习表征会议(ICLR),2014年。

[119] M. Long, G. Ding, J. Wang, J. Sun, Y. Guo, and P. S. Yu. "Transfer Sparse Coding for Robust Image Representation". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2013.

[119] M. Long, G. Ding, J. Wang, J. Sun, Y. Guo, 和 P. S. Yu. “用于鲁棒图像表示的迁移稀疏编码”。载于:计算机视觉与模式识别会议(CVPR),2013年。

[120] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. S. Zemel. "The Variational Fair Autoen-coder". In: International Conference on Learning Representations, ICLR. 2016.

[120] C. Louizos, K. Swersky, Y. Li, M. Welling, 和 R. S. Zemel. “变分公平自编码器”。载于:国际学习表征会议(ICLR),2016年。

[121] T. Luong, H. Pham, and C. D. Manning. "Effective Approaches to Attention-based Neural Machine Translation". In: Conference on Empirical Methods in Natural Language Processing, EMNLP. 2015.

[121] T. Luong, H. Pham, 和 C. D. Manning. “基于注意力的神经机器翻译的有效方法”。载于:自然语言处理实证方法会议(EMNLP),2015年。

[122] D. Mahajan, S. Tople, and A. Sharma. Domain Generalization using Causal Matching. 2020. arXiv: 2006.07500 [cs.LG].

[122] D. Mahajan, S. Tople, 和 A. Sharma. 使用因果匹配进行领域泛化。2020。arXiv: 2006.07500 [cs.LG]。

[123] M. Mancini, Z. Akata, E. Ricci, and B. Caputo. "Towards Recognizing Unseen Categories in Unseen Domains". In: European Conference on Computer Vision, ECCV. 2020.

[123] M. Mancini, Z. Akata, E. Ricci, 和 B. Caputo. “迈向识别未见类别于未见领域”。载于:欧洲计算机视觉会议(ECCV),2020。

[124] M. Mancini, S. R. Bulò, B. Caputo, and E. Ricci. "Best Sources Forward: Domain Generalization through Source-Specific Nets". In: International Conference on Image Processing, ICIP. 2018.

[124] M. Mancini, S. R. Bulò, B. Caputo, 和 E. Ricci. “最佳源前进:通过源特定网络实现领域泛化”。载于:国际图像处理会议(ICIP),2018。

[125] M. Mancini, S. R. Bulò, B. Caputo, and E. Ricci. "Robust Place Categorization With Deep Domain Generalization". In: IEEE Robotics and Automation Letters 3.3 (2018), pp. 2093-2100.

[125] M. Mancini, S. R. Bulò, B. Caputo, 和 E. Ricci. “通过深度领域泛化实现鲁棒的场所分类”。载于:IEEE机器人与自动化快报,3卷3期(2018),第2093-2100页。

[126] M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, and E. Ricci. "Boosting Domain Adaptation by Discovering Latent Domains". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2018.

[126] M. Mancini, L. Porzi, S. R. Bulò, B. Caputo, 和 E. Ricci. “通过发现潜在领域提升领域适应”。载于:计算机视觉与模式识别会议(CVPR),2018。

[127] T. Matsuura and T. Harada. "Domain Generalization Using a Mixture of Multiple Latent Domains". In: AAAI Conference on Artificial Intelligence, AAAI. 2020.

[127] T. Matsuura 和 T. Harada. “使用多潜在领域混合的领域泛化”。载于:美国人工智能协会会议(AAAI),2020。

[128] C. Molnar. Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. https://christophm.github.io/interpretable-ml-book/.2019.

[128] C. Molnar. 可解释的机器学习。使黑箱模型可解释的指南。https://christophm.github.io/interpretable-ml-book/。2019。

[129] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K. Müller. "Explaining nonlinear classification decisions with deep Taylor decomposition". In: Pattern Recognition 65 (2017), pp. 211-222.

[129] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, 和 K. Müller. “利用深度泰勒分解解释非线性分类决策”。载于:模式识别,65卷(2017),第211-222页。

[130] P. Morerio, J. Cavazza, R. Volpi, R. Vidal, and V. Murino. "Curriculum Dropout". In: International Conference on Computer Vision, ICCV. 2017.

[130] P. Morerio, J. Cavazza, R. Volpi, R. Vidal, 和 V. Murino. “课程式丢弃(Curriculum Dropout)”。载于:国际计算机视觉会议(ICCV),2017。

[131] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto. "Unified Deep Supervised Domain Adaptation and Generalization". In: International Conference on Computer Vision, ICCV. 2017.

[131] S. Motiian, M. Piccirilli, D. A. Adjeroh, 和 G. Doretto. “统一的深度监督领域适应与泛化”。载于:国际计算机视觉会议(ICCV),2017。

[132] K. Muandet, D. Balduzzi, and B. Schölkopf. "Domain Generalization via Invariant Feature Representation". In: International Conference on Machine Learning, ICML. 2013.

[132] K. Muandet, D. Balduzzi, 和 B. Schölkopf. “通过不变特征表示实现领域泛化”。载于:国际机器学习会议(ICML),2013。

[133] K. Muandet, K. Fukumizu, B. K. Sriperumbudur, and B. Schölkopf. "Kernel Mean Embedding of Distributions: A Review and Beyond". In: Foundations and Trends in Machine Learning 10.1-2 (2017), pp. 1-141.

[133] K. Muandet, K. Fukumizu, B. K. Sriperumbudur, 和 B. Schölkopf. “分布的核均值嵌入:综述及拓展”。载于:机器学习基础与趋势,10卷1-2期(2017),第1-141页。

[134] W. J. Murdoch and A. Szlam. "Automatic Rule Extraction from Long Short Term Memory Networks". In: International Conference on Learning Representations, ICLR. 2017.

[134] W. J. Murdoch 和 A. Szlam. “从长短期记忆网络自动提取规则”。载于:国际学习表征会议(ICLR),2017。

[135] H. Nam, H. Lee, J. Park, W. Yoon, and D. Yoo. Reducing Domain Gap via Style-Agnostic Networks. 2019. arXiv: 1910.11645 [cs.CV].

[135] H. Nam, H. Lee, J. Park, W. Yoon, 和 D. Yoo. 通过风格无关网络减少领域差异。2019。arXiv: 1910.11645 [cs.CV]。

[136] S. J. Nowlan and G. E. Hinton. "Simplifying Neural Networks by Soft Weight-Sharing". In: Neural Computation 4.4 (1992), pp. 473-493.

[136] S. J. Nowlan 和 G. E. Hinton. “通过软权重共享简化神经网络”。载于:神经计算,4卷4期(1992),第473-493页。

[137] O. Nuriel, S. Benaim, and L. Wolf. Permuted AdaIN: Enhancing the Representation of Local Cues in Image Classifiers. 2020. arXiv: 2010.05785 [cs.CV].

[137] O. Nuriel, S. Benaim, 和 L. Wolf. Permuted AdaIN:增强图像分类器中局部线索的表示。2020。arXiv: 2010.05785 [cs.CV]。

[138] D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, and M. Rohrbach. "Multimodal Explanations: Justifying Decisions and Pointing to the Evidence". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2018.

[138] D. H. Park, L. A. Hendricks, Z. Akata, A. Rohrbach, B. Schiele, T. Darrell, 和 M. Rohrbach. “多模态解释:为决策提供理由并指向证据”。发表于:计算机视觉与模式识别会议,CVPR。2018年。

[139] S. Park and N. Kwak. "Analysis on the Dropout Effect in Convolutional Neural Networks". In: Asian Conference on Computer Vision, ACCV. 2016.

[139] S. Park 和 N. Kwak. “卷积神经网络中Dropout效应的分析”。发表于:亚洲计算机视觉会议,ACCV。2016年。

[140] F. Pasa, V. Golkov, F. Pfeiffer, D. Cremers, and D. Pfeiffer. "Efficient Deep Network Architectures for Fast Chest X-Ray Tuberculosis Screening and Visualization". In: Scientific Reports 9.1 (2019).

[140] F. Pasa, V. Golkov, F. Pfeiffer, D. Cremers, 和 D. Pfeiffer. “用于快速胸部X光结核病筛查与可视化的高效深度网络架构”。发表于:科学报告(Scientific Reports)9卷1期(2019年)。

[141] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. "Moment Matching for Multi-Source Domain Adaptation". In: International Conference on Computer Vision, ICCV. 2019.

[141] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, 和 B. Wang. “多源域适应的矩匹配方法”。发表于:国际计算机视觉大会,ICCV。2019年。

[142] C. S. Perone, P. L. Ballester, R. C. Barros, and J. Cohen-Adad. "Unsupervised domain adaptation for medical imaging segmentation with self-ensembling". In: NeuroImage 194 (2019), pp. 1-11.

[142] C. S. Perone, P. L. Ballester, R. C. Barros, 和 J. Cohen-Adad. “基于自我集成的无监督医学影像分割域适应”。发表于:神经影像学(NeuroImage)194期(2019年),第1-11页。

[143] J. Peters, P. Bühlmann, and N. Meinshausen. "Causal inference using invariant prediction: identification and confidence intervals". In: Journal of the Royal Statistical Society, Series B (Statistical Methodology) 78.5 (2016), pp. 947-1012.

[143] J. Peters, P. Bühlmann, 和 N. Meinshausen. “利用不变预测进行因果推断:识别与置信区间”。发表于:英国皇家统计学会学报B辑(统计方法论)78卷5期(2016年),第947-1012页。

[144] F. du Pin Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, and K. R. Varshney. "Optimized Pre-Processing for Discrimination Prevention". In: Advances in Neural Information Processing Systems, NIPS. 2017.

[144] F. du Pin Calmon, D. Wei, B. Vinzamuri, K. N. Ramamurthy, 和 K. R. Varshney. “优化的预处理以防止歧视”。发表于:神经信息处理系统大会,NIPS。2017年。

[145] V. Piratla, P. Netrapalli, and S. Sarawagi. "Efficient Domain Generalization via Common-Specific Low-Rank Decomposition". In: International Conference on Machine Learning, ICML. 2020.

[145] V. Piratla, P. Netrapalli, 和 S. Sarawagi. “通过共性-特异性低秩分解实现高效域泛化”。发表于:国际机器学习大会,ICML。2020年。

[146] G. Pleiss, M. Raghavan, F. Wu, J. M. Kleinberg, and K. Q. Weinberger. "On Fairness and Calibration". In: Advances in Neural Information Processing Systems, NIPS. 2017.

[146] G. Pleiss, M. Raghavan, F. Wu, J. M. Kleinberg, 和 K. Q. Weinberger. “关于公平性与校准”。发表于:神经信息处理系统大会,NIPS。2017年。

[147] C. E. Priebe, D. J. Marchette, J. DeVinney, and D. A. Socolinsky. "Classification Using Class Cover Catch Digraphs". In: Journal of Classification 20.1 (2003), pp. 003-023.

[147] C. E. Priebe, D. J. Marchette, J. DeVinney, 和 D. A. Socolinsky. “基于类覆盖捕获有向图的分类方法”。发表于:分类学杂志(Journal of Classification)20卷1期(2003年),第003-023页。

[148] F. Qiao, L. Zhao, and X. Peng. "Learning to Learn Single Domain Generalization". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2020.

[148] F. Qiao, L. Zhao, 和 X. Peng. “学习单域泛化的学习方法”。发表于:计算机视觉与模式识别会议,CVPR。2020年。

[149] C. Qin, H. Zhu, T. Xu, C. Zhu, L. Jiang, E. Chen, and H. Xiong. "Enhancing Person-Job Fit for Talent Recruitment: An Ability-aware Neural Network Approach". In: SIGIR Conference on Research & Development in Information Retrieval, SIGIR. 2018.

[149] C. Qin, H. Zhu, T. Xu, C. Zhu, L. Jiang, E. Chen, 和 H. Xiong. “提升人才招聘中的人岗匹配:一种能力感知神经网络方法”。发表于:信息检索研究与开发会议,SIGIR。2018年。

[150] M. M. Rahman, C. Fookes, M. Baktashmotlagh, and S. Sridharan. "Correlation-aware adversarial domain adaptation and generalization". In: Pattern Recognition 100 (2020), p. 107124.

[150] M. M. Rahman, C. Fookes, M. Baktashmotlagh, 和 S. Sridharan. “相关感知的对抗域适应与泛化”。发表于:模式识别(Pattern Recognition)100期(2020年),文章编号107124。

[151] M. M. Rahman, C. Fookes, M. Baktashmotlagh, and S. Sridharan. "Multi-Component Image Translation for Deep Domain Generalization". In: Winter Conference on Applications of Computer Vision, WACV. 2019.

[151] M. M. Rahman, C. Fookes, M. Baktashmotlagh, 和 S. Sridharan. “用于深度域泛化的多组件图像转换”。发表于:冬季计算机视觉应用会议,WACV。2019年。

[152] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, L. Gruber, M. Holzleitner, M. Pavlović, G. K. Sandve, V. Greiff, D. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. "Hopfield Networks is All You Need". In: International Conference on Learning Representations, ICLR. 2021.

[152] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, L. Gruber, M. Holzleitner, M. Pavlović, G. K. Sandve, V. Greiff, D. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, 和 S. Hochreiter. “Hopfield网络即所需一切”。发表于:国际学习表征会议,ICLR。2021年。

[153] M. T. Ribeiro, S. Singh, and C. Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier". In: International Conference on Knowledge Discovery and Data Mining, SIGKDD. 2016.

[153] M. T. Ribeiro, S. Singh, 和 C. Guestrin. “我为什么要信任你?”:解释任意分类器的预测。发表于:知识发现与数据挖掘国际会议,SIGKDD。2016年。

[154] M. T. Ribeiro, S. Singh, and C. Guestrin. "Anchors: High-Precision Model-Agnostic Explanations". In: AAAI Conference on Artificial Intelligence, AAAI. 2018.

[154] M. T. Ribeiro, S. Singh, 和 C. Guestrin. “Anchors: 高精度模型无关解释方法”。发表于:人工智能协会年会(AAAI),2018年。

[155] H. Robbins and S. Monro. "A Stochastic Approximation Method". In: The Annals of Mathematical Statistics 22.3 (1951), pp. 400-407.

[155] H. Robbins 和 S. Monro. “随机逼近方法”。发表于:《数学统计年刊》22卷3期 (1951),第400-407页。

[156] M. Robnik-Sikonja and I. Kononenko. "Explaining Classifications For Individual Instances". In: IEEE Transactions on Knowledge and Data Engineering 20.5 (2008), pp. 589-600.

[156] M. Robnik-Sikonja 和 I. Kononenko. “个体实例分类解释”。发表于:《IEEE知识与数据工程汇刊》20卷5期 (2008),第589-600页。

[157] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. "ImageNet Large Scale Visual Recognition Challenge". In: International Journal of Computer Vision, IJCV 115.3 (2015), pp. 211-252.

[157] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, 和 F. Li. “ImageNet大规模视觉识别挑战”。发表于:《国际计算机视觉杂志》(IJCV) 115卷3期 (2015),第211-252页。

[158] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. "LabelMe: A Database and Web-Based Tool for Image Annotation". In: International Journal of Computer Vision, IJCV 77.1-3 (2008), pp. 157-173.

[158] B. C. Russell, A. Torralba, K. P. Murphy, 和 W. T. Freeman. “LabelMe:一个图像标注数据库及基于网页的工具”。发表于:《国际计算机视觉杂志》(IJCV) 77卷1-3期 (2008),第157-173页。

[159] S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang. "Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization". In: International Conference on Learning Representations, ICLR. 2020.

[159] S. Sagawa, P. W. Koh, T. B. Hashimoto, 和 P. Liang. “针对群体分布变化的分布鲁棒神经网络:正则化对最坏情况泛化的重要性”。发表于:学习表征国际会议(ICLR),2020年。

[160] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. "The sketchy database: learning to retrieve badly drawn bunnies". In: Transactions on Graphics, TOG 35.4 (2016), 119:1-119:12.

[160] P. Sangkloy, N. Burnell, C. Ham, 和 J. Hays. “The sketchy database:学习检索画得很差的兔子”。发表于:《图形学汇刊》(TOG) 35卷4期 (2016),119:1-119:12。

[161] J. Schmidhuber, J. Zhao, and M. A. Wiering. "Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement". In: Machine Learning 28.1 (1997), pp. 105-130.

[161] J. Schmidhuber, J. Zhao, 和 M. A. Wiering. “通过成功故事算法、自适应Levin搜索和增量自我改进调整归纳偏置”。发表于:《机器学习》28卷1期 (1997),第105-130页。

[162] R. M. Schmidt. Recurrent Neural Networks (RNNs): A gentle Introduction and Overview. 2019. arXiv: 1912.05911 [cs.LG].

[162] R. M. Schmidt. 循环神经网络(RNNs):简明介绍与概述。2019年。arXiv: 1912.05911 [cs.LG]。

[163] R. M. Schmidt, F. Schneider, and P. Hennig. Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers. 2020. arXiv: 2007.01547 [cs.LG].

[163] R. M. Schmidt, F. Schneider, 和 P. Hennig. 穿越拥挤山谷——深度学习优化器基准测试。2020年。arXiv: 2007.01547 [cs.LG]。

[164] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. "Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization". In: International Conference on Computer Vision, ICCV. 2017.

[164] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, 和 D. Batra. “Grad-CAM:基于梯度定位的深度网络视觉解释”。发表于:计算机视觉国际会议(ICCV),2017年。

[165] C. Sen, T. Hartvigsen, B. Yin, X. Kong, and E. A. Rundensteiner. "Human Attention Maps for Text Classification: Do Humans and Neural Networks Focus on the Same Words?" In: Annual Meeting of the Association for Computational Linguistics, ACL. 2020.

[165] C. Sen, T. Hartvigsen, B. Yin, X. Kong, 和 E. A. Rundensteiner. “文本分类中的人类注意力图:人类和神经网络关注的是同样的词吗?”发表于:计算语言学协会年会(ACL),2020年。

[166] S. Seo, Y. Suh, D. Kim, G. Kim, J. Han, and B. Han. "Learning to Optimize Domain Specific Normalization for Domain Generalization". In: European Conference on Computer Vision, ECCV . 2020.

[166] S. Seo, Y. Suh, D. Kim, G. Kim, J. Han, 和 B. Han. “学习优化领域特定归一化以实现领域泛化”。发表于:欧洲计算机视觉会议(ECCV),ECCV,2020年。

[167] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, and S. Sarawagi. "Generalizing Across Domains via Cross-Gradient Training". In: International Conference on Learning Representations, ICLR. 2018.

[167] S. Shankar, V. Piratla, S. Chakrabarti, S. Chaudhuri, P. Jyothi, 和 S. Sarawagi. “通过交叉梯度训练实现跨领域泛化”。发表于:学习表征国际会议(ICLR),2018年。

[168] A. Shrikumar, P. Greenside, and A. Kundaje. "Learning Important Features Through Propagating Activation Differences". In: International Conference on Machine Learning, ICML. 2017.

[168] A. Shrikumar, P. Greenside, 和 A. Kundaje. “通过传播激活差异学习重要特征”。发表于:机器学习国际会议(ICML),2017年。

[169] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. "Learning from Simulated and Unsupervised Images through Adversarial Training". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2017.

[169] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, 和 R. Webb. “通过对抗训练从模拟和无监督图像中学习”。发表于:计算机视觉与模式识别会议(CVPR),2017年。

[170] K. Simonyan, A. Vedaldi, and A. Zisserman. "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps". In: International Conference on Learning Representations, ICLR. 2014.

[170] K. Simonyan, A. Vedaldi, 和 A. Zisserman. “深入卷积网络:图像分类模型和显著性图的可视化”。发表于:国际学习表征会议(ICLR),2014年。

[171] K. Simonyan and A. Zisserman. "Very Deep Convolutional Networks for Large-Scale Image Recognition". In: International Conference on Learning Representations, ICLR. 2015.

[171] K. Simonyan 和 A. Zisserman. “用于大规模图像识别的非常深的卷积网络”。发表于:国际学习表征会议(ICLR),2015年。

[172] K. K. Singh and Y. J. Lee. "Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization". In: International Conference on Computer Vision, ICCV. 2017.

[172] K. K. Singh 和 Y. J. Lee. “捉迷藏:强制网络在弱监督的目标和动作定位中更加细致”。发表于:国际计算机视觉大会(ICCV),2017年。

[173] J. Snell, K. Swersky, and R. S. Zemel. "Prototypical Networks for Few-shot Learning". In: Advances in Neural Information Processing Systems, NIPS. 2017.

[173] J. Snell, K. Swersky, 和 R. S. Zemel. “原型网络用于少样本学习”。发表于:神经信息处理系统大会(NIPS),2017年。

[174] N. Somavarapu, C.-Y. Ma, and Z. Kira. Frustratingly Simple Domain Generalization via Image Stylization. 2020. arXiv: 2006.11207 [cs.CV].

[174] N. Somavarapu, C.-Y. Ma, 和 Z. Kira. 通过图像风格化实现极其简单的领域泛化。2020年。arXiv: 2006.11207 [cs.CV]。

[175] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. "Striving for Simplicity: The All Convolutional Net". In: International Conference on Learning Representations, ICLR. 2015.

[175] J. T. Springenberg, A. Dosovitskiy, T. Brox, 和 M. A. Riedmiller. “追求简洁:全卷积网络”。发表于:国际学习表征会议(ICLR),2015年。

[176] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, and B. Schölkopf. "Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions". In: Advances in Neural Information Processing Systems, NIPS. 2009.

[176] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, G. R. G. Lanckriet, 和 B. Schölkopf. “概率分布的再生核希尔伯特空间(RKHS)嵌入的核选择与可分类性”。发表于:神经信息处理系统大会(NIPS),2009年。

[177] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. "Dropout: a simple way to prevent neural networks from overfitting". In: Journal of Machine Learning Research, JMLR 15.1 (2014), pp. 1929-1958.

[177] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, 和 R. Salakhutdinov. “Dropout:防止神经网络过拟合的简单方法”。发表于:机器学习研究期刊(JMLR)15.1 (2014), 页码1929-1958。

[178] P. Stock and M. Cissé. "ConvNets and ImageNet Beyond Accuracy: Understanding Mistakes and Uncovering Biases". In: European Conference on Computer Vision, ECCV. 2018.

[178] P. Stock 和 M. Cissé. “卷积网络与ImageNet超越准确率:理解错误与揭示偏差”。发表于:欧洲计算机视觉大会(ECCV),2018年。

[179] B. Sun and K. Saenko. "Deep CORAL: Correlation Alignment for Deep Domain Adaptation". In: European Conference on Computer Vision, ECCV. 2016.

[179] B. Sun 和 K. Saenko. “Deep CORAL:用于深度领域自适应的相关性对齐”。发表于:欧洲计算机视觉大会(ECCV),2016年。

[180] G. Sun, S. Khan, W. Li, H. Cholakkal, F. Khan, and L. V. Gool. "Fixing Localization Errors to Improve Image Classification". In: European Conference on Computer Vision, ECCV. 2020.

[180] G. Sun, S. Khan, W. Li, H. Cholakkal, F. Khan, 和 L. V. Gool. “修正定位错误以提升图像分类性能”。发表于:欧洲计算机视觉大会(ECCV),2020年。

[181] M. Sundararajan, A. Taly, and Q. Yan. "Axiomatic Attribution for Deep Networks". In: International Conference on Machine Learning, ICML. 2017.

[181] M. Sundararajan, A. Taly, 和 Q. Yan. “深度网络的公理归因方法”。发表于:国际机器学习大会(ICML),2017年。

[182] S. Tan, R. Caruana, G. Hooker, P. Koch, and A. Gordo. Learning Global Additive Explanations for Neural Nets Using Model Distillation. 2018. arXiv: 1801.08640 [stat.ML].

[182] S. Tan, R. Caruana, G. Hooker, P. Koch, 和 A. Gordo. 使用模型蒸馏学习神经网络的全局加性解释。2018年。arXiv: 1801.08640 [stat.ML]。

[183] S. Thrun and L. Y. Pratt. Learning to Learn. Springer, 1998.

[183] S. Thrun 和 L. Y. Pratt. 学习如何学习。施普林格出版社,1998年。

[184] I. O. Tolstikhin, O. Bousquet, S. Gelly, and B. Schölkopf. "Wasserstein Auto-Encoders". In: International Conference on Learning Representations, ICLR. 2018.

[184] I. O. Tolstikhin, O. Bousquet, S. Gelly, 和 B. Schölkopf. “Wasserstein自编码器”。发表于:国际学习表征会议(ICLR),2018年。

[185] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. "Efficient object localization using Convolutional Networks". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2015.

[185] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, 和 C. Bregler. “使用卷积网络实现高效目标定位”。发表于:计算机视觉与模式识别会议(CVPR),2015年。

[186] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. "Adversarial Discriminative Domain Adaptation". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2017.

[186] E. Tzeng, J. Hoffman, K. Saenko, 和 T. Darrell. “对抗判别域适应(Adversarial Discriminative Domain Adaptation)”. 载于:计算机视觉与模式识别会议(CVPR),2017年。

[187] V. Vapnik. Statistical Learning Theory. 1998.

[187] V. Vapnik. 统计学习理论(Statistical Learning Theory). 1998年。

[188] K. R. Varshney and H. Alemzadeh. "On the Safety of Machine Learning: Cyber-Physical Systems, Decision Sciences, and Data Products". In: Big Data 5.3 (2017), pp. 246-255.

[188] K. R. Varshney 和 H. Alemzadeh. “关于机器学习安全性:网络物理系统、决策科学与数据产品”. 载于:《大数据》(Big Data)5卷3期(2017年),第246-255页。

[189] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. "Attention is All you Need". In: Advances in Neural Information Processing Systems, NIPS. 2017.

[189] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, 和 I. Polosukhin. “注意力机制即一切(Attention is All you Need)”. 载于:神经信息处理系统进展(NIPS),2017年。

[190] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. "Deep Hashing Network for Unsupervised Domain Adaptation". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2017.

[190] H. Venkateswara, J. Eusebio, S. Chakraborty, 和 S. Panchanathan. “用于无监督域适应的深度哈希网络(Deep Hashing Network for Unsupervised Domain Adaptation)”. 载于:计算机视觉与模式识别会议(CVPR),2017年。

[191] G. Volk, S. Müller, A. von Bernuth, D. Hospach, and O. Bringmann. "Towards Robust CNN-based Object Detection through Augmentation with Synthetic Rain Variations". In: International Conference on Intelligent Transportation Systems Conference, ITSC. 2019.

[191] G. Volk, S. Müller, A. von Bernuth, D. Hospach, 和 O. Bringmann. “通过合成雨变换增强实现基于卷积神经网络的鲁棒目标检测”. 载于:智能交通系统国际会议(ITSC),2019年。

[192] R. Volpi and V. Murino. "Addressing Model Vulnerability to Distributional Shifts Over Image Transformation Sets". In: International Conference on Computer Vision, ICCV. 2019.

[192] R. Volpi 和 V. Murino. “解决模型对图像变换集分布偏移的脆弱性”. 载于:国际计算机视觉会议(ICCV),2019年。

[193] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese. "Generalizing to Unseen Domains via Adversarial Data Augmentation". In: Advances in Neural Information Processing Systems, NeurIPS. 2018.

[193] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, 和 S. Savarese. “通过对抗性数据增强实现对未见域的泛化”. 载于:神经信息处理系统进展(NeurIPS),2018年。

[194] H. Wang, Z. He, Z. C. Lipton, and E. P. Xing. "Learning Robust Representations by Projecting Superficial Statistics Out". In: International Conference on Learning Representations, ICLR. 2019.

[194] H. Wang, Z. He, Z. C. Lipton, 和 E. P. Xing. “通过投影去除表面统计量学习鲁棒表示”. 载于:国际学习表征会议(ICLR),2019年。

[195] Y. Wang, H. Li, and A. C. Kot. "Heterogeneous Domain Generalization Via Domain Mixup". In: International Conference on Acoustics, Speech and Signal Processing, ICASSP. 2020.

[195] Y. Wang, H. Li, 和 A. C. Kot. “通过域混合实现异构域泛化”. 载于:国际声学、语音与信号处理会议(ICASSP),2020年。

[196] S. Wiegreffe and Y. Pinter. "Attention is not not Explanation". In: Conference on Empirical Methods in Natural Language Processing, EMNLP. 2019.

[196] S. Wiegreffe 和 Y. Pinter. “注意力机制并非解释”. 载于:自然语言处理实证方法会议(EMNLP),2019年。

[197] B. E. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. "Learning Non-Discriminatory Predictors". In: Conference on Learning Theory, COLT. 2017.

[197] B. E. Woodworth, S. Gunasekar, M. I. Ohannessian, 和 N. Srebro. “学习非歧视性预测器”. 载于:学习理论会议(COLT),2017年。

[198] C. Wu and E. G. Tabak. Prototypal Analysis and Prototypal Regression. 2017. arXiv: 1701. 08916 [stat.ML].

[198] C. Wu 和 E. G. Tabak. 原型分析与原型回归(Prototypal Analysis and Prototypal Regression). 2017年. arXiv: 1701.08916 [stat.ML].

[199] N. Xie, G. Ras, M. van Gerven, and D. Doran. Explainable Deep Learning: A Field Guide for the Uninitiated. 2020. arXiv: 2004.14545 [cs.LG].

[199] N. Xie, G. Ras, M. van Gerven, 和 D. Doran. 可解释深度学习:初学者指南(Explainable Deep Learning: A Field Guide for the Uninitiated). 2020年. arXiv: 2004.14545 [cs.LG].

[200] M. Xu, J. Zhang, B. Ni, T. Li, C. Wang, Q. Tian, and W. Zhang. "Adversarial Domain Adaptation with Domain Mixup". In: AAAI Conference on Artificial Intelligence, AAAI. 2020.

[200] M. Xu, J. Zhang, B. Ni, T. Li, C. Wang, Q. Tian, 和 W. Zhang. “基于域混合的对抗域适应”. 载于:美国人工智能协会会议(AAAI),2020年。

[201] W. Xu, Y. Xian, J. Wang, B. Schiele, and Z. Akata. "Attribute Prototype Network for Zero-Shot Learning". In: Advances in Neural Information Processing Systems, NeurIPS. 2020.

[201] W. Xu, Y. Xian, J. Wang, B. Schiele, 和 Z. Akata. “零样本学习的属性原型网络”. 载于:神经信息处理系统进展(NeurIPS),2020年。

[202] M. Yamada, L. Sigal, and M. Raptis. "No Bias Left behind: Covariate Shift Adaptation for Discriminative 3D Pose Estimation". In: European Conference on Computer Vision, ECCV. 2012.

[202] M. Yamada, L. Sigal, 和 M. Raptis. “无偏见遗留:判别式三维姿态估计的协变量偏移适应”。载于:欧洲计算机视觉大会(ECCV),2012年。

[203] S. Yan, H. Song, N. Li, L. Zou, and L. Ren. Improve Unsupervised Domain Adaptation with Mixup Training. 2020. arXiv: 2001.00677 [stat.ML].

[203] S. Yan, H. Song, N. Li, L. Zou, 和 L. Ren. 通过混合训练提升无监督领域适应。2020年。arXiv: 2001.00677 [stat.ML]。

[204] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi. "Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment". In: International Conference on World Wide Web, WWW. 2017.

[204] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, 和 K. P. Gummadi. “超越差别对待与差别影响的公平性:无差别误用的分类学习”。载于:国际万维网大会(WWW),2017年。

[205] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M. Yang, and L. Shao. "CycleISP: Real Image Restoration via Improved Data Synthesis". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2020.

[205] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M. Yang, 和 L. Shao. “CycleISP:通过改进数据合成实现真实图像恢复”。载于:计算机视觉与模式识别会议(CVPR),2020年。

[206] M. D. Zeiler and R. Fergus. "Visualizing and Understanding Convolutional Networks". In: European Conference on Computer Vision, ECCV. 2014.

[206] M. D. Zeiler 和 R. Fergus. “卷积网络的可视化与理解”。载于:欧洲计算机视觉大会(ECCV),2014年。

[207] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. "From Recognition to Cognition: Visual Commonsense Reasoning". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2019.

[207] R. Zellers, Y. Bisk, A. Farhadi, 和 Y. Choi. “从识别到认知:视觉常识推理”。载于:计算机视觉与模式识别会议(CVPR),2019年。

[208] R. S. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork. "Learning Fair Representations". In: International Conference on Machine Learning, ICML. 2013.

[208] R. S. Zemel, Y. Wu, K. Swersky, T. Pitassi, 和 C. Dwork. “学习公平表示”。载于:国际机器学习大会(ICML),2013年。

[209] X. Zeng, W. Ouyang, M. Wang, and X. Wang. "Deep Learning of Scene-Specific Classifier for Pedestrian Detection". In: European Conference on Computer Vision, ECCV. 2014.

[209] X. Zeng, W. Ouyang, M. Wang, 和 X. Wang. “行人检测的场景特定分类器深度学习”。载于:欧洲计算机视觉大会(ECCV),2014年。

[210] H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz. "mixup: Beyond Empirical Risk Minimization". In: International Conference on Learning Representations, ICLR. 2018.

[210] H. Zhang, M. Cissé, Y. N. Dauphin, 和 D. Lopez-Paz. “mixup:超越经验风险最小化”。载于:国际表征学习会议(ICLR),2018年。

[211] L. Zhang, X. Wang, D. Yang, T. Sanford, S. Harmon, B. Turkbey, H. Roth, A. Myronenko, D. Xu, and Z. Xu. When Unseen Domain Generalization is Unnecessary? Rethinking Data Augmentation. 2019. arXiv: 1906.03347 [cs.CV].

[211] L. Zhang, X. Wang, D. Yang, T. Sanford, S. Harmon, B. Turkbey, H. Roth, A. Myronenko, D. Xu, 和 Z. Xu. 何时无需未见域泛化?重新思考数据增强。2019年。arXiv: 1906.03347 [cs.CV]。

[212] M. Zhang, H. Marklund, N. Dhawan, A. Gupta, S. Levine, and C. Finn. Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift. 2020. arXiv: 2007.02931 [cs.LG].

[212] M. Zhang, H. Marklund, N. Dhawan, A. Gupta, S. Levine, 和 C. Finn. 自适应风险最小化:应对群体偏移的元学习方法。2020年。arXiv: 2007.02931 [cs.LG]。

[213] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, and S. Zhu. "Interpreting CNN Knowledge via an Explanatory Graph". In: AAAI Conference on Artificial Intelligence, AAAI. 2018.

[213] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, 和 S. Zhu. “通过解释图解读卷积神经网络知识”。载于:美国人工智能协会会议(AAAI),2018年。

[214] Q. Zhang, R. Cao, Y. N. Wu, and S. Zhu. "Growing Interpretable Part Graphs on ConvNets via Multi-Shot Learning". In: AAAI Conference on Artificial Intelligence, AAAI. 2017.

[214] Q. Zhang, R. Cao, Y. N. Wu, 和 S. Zhu. “通过多次学习在卷积网络上构建可解释的部件图”。载于:美国人工智能协会会议(AAAI),2017年。

[215] Q. Zhang, Y. Yang, H. Ma, and Y. N. Wu. "Interpreting CNNs via Decision Trees". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2019.

[215] Q. Zhang, Y. Yang, H. Ma, 和 Y. N. Wu. “通过决策树解读卷积神经网络”。载于:计算机视觉与模式识别会议(CVPR),2019年。

[216] Y. Zhao, M. K. Hryniewicki, F. Cheng, B. Fu, and X. Zhu. "Employee Turnover Prediction with Machine Learning: A Reliable Approach". In: Intelligent Systems Conference, IntelliSys. 2018.

[216] Y. Zhao, M. K. Hryniewicki, F. Cheng, B. Fu, 和 X. Zhu. “员工流失预测的机器学习可靠方法”。载于:智能系统会议(IntelliSys),2018年。

[217] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba. "Learning Deep Features for Discriminative Localization". In: Conference on Computer Vision and Pattern Recognition, CVPR. 2016.

[217] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, 和 A. Torralba. “学习判别性定位的深度特征”。载于:计算机视觉与模式识别会议(CVPR),2016年。

[218] K. Zhou, Y. Yang, T. Hospedales, and T. Xiang. "Learning to Generate Novel Domains for Domain Generalization". In: European Conference on Computer Vision, ECCV. 2020.

[218] K. Zhou, Y. Yang, T. Hospedales, 和 T. Xiang. “学习生成新颖领域以实现领域泛化”。载于欧洲计算机视觉大会(ECCV),2020年。

[219] K. Zhou, Y. Yang, T. M. Hospedales, and T. Xiang. "Deep Domain-Adversarial Image Generation for Domain Generalisation". In: AAAI Conference on Artificial Intelligence, AAAI. 2020.

[219] K. Zhou, Y. Yang, T. M. Hospedales, 和 T. Xiang. “用于领域泛化的深度领域对抗图像生成”。载于美国人工智能协会会议(AAAI),2020年。

[220] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling. "Visualizing Deep Neural Network Decisions: Prediction Difference Analysis". In: International Conference on Learning Representations, ICLR. 2017.

[220] L. M. Zintgraf, T. S. Cohen, T. Adel, 和 M. Welling. “深度神经网络决策的可视化:预测差异分析”。载于国际学习表征会议(ICLR),2017年。

[221] A. Zunino, S. A. Bargal, R. Volpi, M. Sameki, J. Zhang, S. Sclaroff, V. Murino, and K. Saenko. Explainable Deep Classification Models for Domain Generalization. 2020. arXiv: 2003.06498 [cs.CV].

[221] A. Zunino, S. A. Bargal, R. Volpi, M. Sameki, J. Zhang, S. Sclaroff, V. Murino, 和 K. Saenko. 可解释的深度分类模型用于领域泛化。2020年。arXiv: 2003.06498 [cs.CV]。

Appendix A

附录 A

Domain-specific results

领域特定结果

Since Table 5.2 only shows the average performance for the respective dataset, we provide the full experimental results here for each of the domains across datasets. The results for the other algorithms besides RSC, DivCAM, and D-TRANSFORMERS are taken from DomAINBED.

鉴于表5.2仅展示了各数据集的平均性能,我们在此提供跨数据集各领域的完整实验结果。除RSC、DivCAM和D-TRANSFORMERS外,其他算法的结果均取自DomAINBED。

AlgorithmCLSV\( \mathbf{{Avg}.} \)
ERM\( {97.7} \pm {0.4} \)\( {64.3} \pm {0.9} \)\( {73.4} \pm {0.5} \)\( {74.6} \pm {1.3} \)77.5
IRM\( {98.6} \pm {0.1} \)\( {64.9} \pm {0.9} \)\( {73.4} \pm {0.6} \)\( {77.3} \pm {0.9} \)78.5
GroupDRO\( {97.3} \pm {0.3} \)\( {63.4} \pm {0.9} \)\( {69.5} \pm {0.8} \)\( {76.7} \pm {0.7} \)76.7
Mixup\( {98.3} \pm {0.6} \)\( {64.8} \pm {1.0} \)\( {72.1} \pm {0.5} \)\( {74.3} \pm {0.8} \)77.4
MLDG\( {97.4} \pm {0.2} \)\( {65.2} \pm {0.7} \)\( {71.0} \pm {1.4} \)\( {75.3} \pm {1.0} \)77.2
CORAL\( {98.3} \pm {0.1} \)\( {66.1} \pm {1.2} \)\( {73.4} \pm {0.3} \)\( {77.5} \pm {1.2} \)78.8
MMD\( {97.7} \pm {0.1} \)\( {64.0} \pm {1.1} \)\( {72.8} \pm {0.2} \)\( {75.3} \pm {3.3} \)77.5
DANN\( {99.0} \pm {0.3} \)\( {65.1} \pm {1.4} \)\( {73.1} \pm {0.3} \)\( {77.2} \pm {0.6} \)78.6
CDANN\( {97.1} \pm {0.3} \)\( {65.1} \pm {1.2} \)\( {70.7} \pm {0.8} \)\( {77.1} \pm {1.5} \)77.5
MTL\( {97.8} \pm {0.4} \)\( {64.3} \pm {0.3} \)\( {71.5} \pm {0.7} \)\( {75.3} \pm {1.7} \)77.2
SagNet\( {97.9} \pm {0.4} \)\( {64.5} \pm {0.5} \)\( {71.4} \pm {1.3} \)\( {77.5} \pm {0.5} \)77.8
ARM\( {98.7} \pm {0.2} \)\( {63.6} \pm {0.7} \)\( {71.3} \pm {1.2} \)\( {76.7} \pm {0.6} \)77.6
VREx\( {98.4} \pm {0.3} \)\( {64.4} \pm {1.4} \)\( {74.1} \pm {0.4} \)\( {76.2} \pm {1.3} \)78.3
RSC\( {97.9} \pm {0.1} \)\( {62.5} \pm {0.7} \)\( {72.3} \pm {1.2} \)\( {75.6} \pm {0.8} \)77.1
DivCAM-S\( {98.7} \pm {0.1} \)\( {64.5} \pm {1.1} \)\( {72.5} \pm {0.7} \)\( {75.5} \pm {0.4} \)77.8
D-TRANSFORMERS\( {98.1} \pm {0.2} \)\( {65.8} \pm {0.6} \)\( {71.7} \pm {0.4} \)\( {79.2} \pm {1.3} \)78.7
ERM*\( {97.6} \pm {0.3} \)\( {67.9} \pm {0.7} \)\( {70.9} \pm {0.2} \)\( {74.0} \pm {0.6} \)77.6
IRM*\( {97.3} \pm {0.2} \)\( {66.7} \pm {0.1} \)\( {71.0} \pm {2.3} \)\( {72.8} \pm {0.4} \)76.9
GroupDRO*\( {97.7} \pm {0.2} \)\( {65.9} \pm {0.2} \)\( {72.8} \pm {0.8} \)\( {73.4} \pm {1.3} \)77.4
Mixup*\( {97.8} \pm {0.4} \)\( {67.2} \pm {0.4} \)\( {71.5} \pm {0.2} \)\( {75.7} \pm {0.6} \)78.1
MLDG*\( {97.1} \pm {0.5} \)\( {66.6} \pm {0.5} \)\( {71.5} \pm {0.1} \)\( {75.0} \pm {0.9} \)77.5
CORAL*\( {97.3} \pm {0.2} \)\( {67.5} \pm {0.6} \)\( {71.6} \pm {0.6} \)\( {74.5} \pm {0.0} \)77.7
MMD*\( {98.8} \pm {0.0} \)\( {66.4} \pm {0.4} \)\( {70.8} \pm {0.5} \)\( {75.6} \pm {0.4} \)77.9
DANN*\( {99.0} \pm {0.2} \)\( {66.3} \pm {1.2} \)\( {73.4} \pm {1.4} \)\( {80.1} \pm {0.5} \)79.7
CDANN*\( {98.2} \pm {0.1} \)\( {68.8} \pm {0.5} \)\( {74.3} \pm {0.6} \)\( {78.1} \pm {0.5} \)79.9
MTL*\( {97.9} \pm {0.7} \)\( {66.1} \pm {0.7} \)\( {72.0} \pm {0.4} \)\( {74.9} \pm {1.1} \)77.7
SagNet*\( {97.4} \pm {0.3} \)\( {66.4} \pm {0.4} \)\( {71.6} \pm {0.1} \)\( {75.0} \pm {0.8} \)77.6
ARM*\( {97.6} \pm {0.6} \)\( {66.5} \pm {0.3} \)\( {72.7} \pm {0.6} \)\( {74.4} \pm {0.7} \)77.8
VREx*\( {98.4} \pm {0.2} \)\( {66.4} \pm {0.7} \)\( {72.8} \pm {0.1} \)\( {75.0} \pm {1.4} \)78.1
RSC*\( {98.0} \pm {0.4} \)\( {67.2} \pm {0.3} \)\( {70.3} \pm {1.3} \)\( {75.6} \pm {0.4} \)77.8
DIVCAM-S*\( {98.0} \pm {0.5} \)\( {66.1} \pm {0.3} \)\( {72.0} \pm {1.0} \)\( {76.4} \pm {0.7} \)78.1
D-TRANSFORMERS*\( {98.1} \pm {0.3} \)\( {66.5} \pm {0.8} \)\( {72.4} \pm {0.9} \)\( {74.0} \pm {0.4} \)77.7
算法CLSV\( \mathbf{{Avg}.} \)
经验风险最小化(ERM)\( {97.7} \pm {0.4} \)\( {64.3} \pm {0.9} \)\( {73.4} \pm {0.5} \)\( {74.6} \pm {1.3} \)77.5
不变风险最小化(IRM)\( {98.6} \pm {0.1} \)\( {64.9} \pm {0.9} \)\( {73.4} \pm {0.6} \)\( {77.3} \pm {0.9} \)78.5
群体分布鲁棒优化(GroupDRO)\( {97.3} \pm {0.3} \)\( {63.4} \pm {0.9} \)\( {69.5} \pm {0.8} \)\( {76.7} \pm {0.7} \)76.7
混合增强(Mixup)\( {98.3} \pm {0.6} \)\( {64.8} \pm {1.0} \)\( {72.1} \pm {0.5} \)\( {74.3} \pm {0.8} \)77.4
元学习领域泛化(MLDG)\( {97.4} \pm {0.2} \)\( {65.2} \pm {0.7} \)\( {71.0} \pm {1.4} \)\( {75.3} \pm {1.0} \)77.2
相关对齐(CORAL)\( {98.3} \pm {0.1} \)\( {66.1} \pm {1.2} \)\( {73.4} \pm {0.3} \)\( {77.5} \pm {1.2} \)78.8
最大均值差异(MMD)\( {97.7} \pm {0.1} \)\( {64.0} \pm {1.1} \)\( {72.8} \pm {0.2} \)\( {75.3} \pm {3.3} \)77.5
领域对抗神经网络(DANN)\( {99.0} \pm {0.3} \)\( {65.1} \pm {1.4} \)\( {73.1} \pm {0.3} \)\( {77.2} \pm {0.6} \)78.6
条件领域对抗神经网络(CDANN)\( {97.1} \pm {0.3} \)\( {65.1} \pm {1.2} \)\( {70.7} \pm {0.8} \)\( {77.1} \pm {1.5} \)77.5
多任务学习(MTL)\( {97.8} \pm {0.4} \)\( {64.3} \pm {0.3} \)\( {71.5} \pm {0.7} \)\( {75.3} \pm {1.7} \)77.2
语义-外观分解网络(SagNet)\( {97.9} \pm {0.4} \)\( {64.5} \pm {0.5} \)\( {71.4} \pm {1.3} \)\( {77.5} \pm {0.5} \)77.8
自适应风险最小化(ARM)\( {98.7} \pm {0.2} \)\( {63.6} \pm {0.7} \)\( {71.3} \pm {1.2} \)\( {76.7} \pm {0.6} \)77.6
方差风险扩展(VREx)\( {98.4} \pm {0.3} \)\( {64.4} \pm {1.4} \)\( {74.1} \pm {0.4} \)\( {76.2} \pm {1.3} \)78.3
表示稀疏性约束(RSC)\( {97.9} \pm {0.1} \)\( {62.5} \pm {0.7} \)\( {72.3} \pm {1.2} \)\( {75.6} \pm {0.8} \)77.1
多样性类激活映射-小(DivCAM-S)\( {98.7} \pm {0.1} \)\( {64.5} \pm {1.1} \)\( {72.5} \pm {0.7} \)\( {75.5} \pm {0.4} \)77.8
D-变换器(D-TRANSFORMERS)\( {98.1} \pm {0.2} \)\( {65.8} \pm {0.6} \)\( {71.7} \pm {0.4} \)\( {79.2} \pm {1.3} \)78.7
经验风险最小化(ERM)*\( {97.6} \pm {0.3} \)\( {67.9} \pm {0.7} \)\( {70.9} \pm {0.2} \)\( {74.0} \pm {0.6} \)77.6
不变风险最小化(IRM)*\( {97.3} \pm {0.2} \)\( {66.7} \pm {0.1} \)\( {71.0} \pm {2.3} \)\( {72.8} \pm {0.4} \)76.9
群体分布鲁棒优化(GroupDRO)*\( {97.7} \pm {0.2} \)\( {65.9} \pm {0.2} \)\( {72.8} \pm {0.8} \)\( {73.4} \pm {1.3} \)77.4
混合增强(Mixup)*\( {97.8} \pm {0.4} \)\( {67.2} \pm {0.4} \)\( {71.5} \pm {0.2} \)\( {75.7} \pm {0.6} \)78.1
元学习领域泛化(MLDG)*\( {97.1} \pm {0.5} \)\( {66.6} \pm {0.5} \)\( {71.5} \pm {0.1} \)\( {75.0} \pm {0.9} \)77.5
相关对齐(CORAL)*\( {97.3} \pm {0.2} \)\( {67.5} \pm {0.6} \)\( {71.6} \pm {0.6} \)\( {74.5} \pm {0.0} \)77.7
最大均值差异(MMD)*\( {98.8} \pm {0.0} \)\( {66.4} \pm {0.4} \)\( {70.8} \pm {0.5} \)\( {75.6} \pm {0.4} \)77.9
领域对抗神经网络(DANN)*\( {99.0} \pm {0.2} \)\( {66.3} \pm {1.2} \)\( {73.4} \pm {1.4} \)\( {80.1} \pm {0.5} \)79.7
条件领域对抗神经网络(CDANN)*\( {98.2} \pm {0.1} \)\( {68.8} \pm {0.5} \)\( {74.3} \pm {0.6} \)\( {78.1} \pm {0.5} \)79.9
多任务学习(MTL)*\( {97.9} \pm {0.7} \)\( {66.1} \pm {0.7} \)\( {72.0} \pm {0.4} \)\( {74.9} \pm {1.1} \)77.7
语义-外观分解网络(SagNet)*\( {97.4} \pm {0.3} \)\( {66.4} \pm {0.4} \)\( {71.6} \pm {0.1} \)\( {75.0} \pm {0.8} \)77.6
自适应风险最小化(ARM)*\( {97.6} \pm {0.6} \)\( {66.5} \pm {0.3} \)\( {72.7} \pm {0.6} \)\( {74.4} \pm {0.7} \)77.8
方差风险扩展(VREx)*\( {98.4} \pm {0.2} \)\( {66.4} \pm {0.7} \)\( {72.8} \pm {0.1} \)\( {75.0} \pm {1.4} \)78.1
表示稀疏性约束(RSC)*\( {98.0} \pm {0.4} \)\( {67.2} \pm {0.3} \)\( {70.3} \pm {1.3} \)\( {75.6} \pm {0.4} \)77.8
多样性类激活映射-小(DivCAM-S)*\( {98.0} \pm {0.5} \)\( {66.1} \pm {0.3} \)\( {72.0} \pm {1.0} \)\( {76.4} \pm {0.7} \)78.1
D-变换器(D-TRANSFORMERS)*\( {98.1} \pm {0.3} \)\( {66.5} \pm {0.8} \)\( {72.4} \pm {0.9} \)\( {74.0} \pm {0.4} \)77.7

Table A.1: Domain specific performance for the VLCS dataset using training-domain validation (top) and oracle validation denoted with * (bottom). We use a ResNet-50 backbone, optimize with ADAM, and follow the distributions specified in DOMAINBED. Only RSC and our methods have been added as part of this work, the other baselines are taken from DOMAINBED.

表A.1:使用训练域验证(上方)和带*标记的oracle验证(下方)对VLCS数据集的特定领域性能。我们采用ResNet-50骨干网络,使用ADAM优化,并遵循DOMAINBED中指定的分布。仅RSC和我们的方法作为本工作新增,其他基线均取自DOMAINBED。

AlgorithmAC\( \mathbf{P} \)S\( \mathbf{{Avg}.} \)
ERM\( {84.7} \pm {0.4} \)\( {80.8} \pm {0.6} \)\( {97.2} \pm {0.3} \)\( {79.3} \pm {1.0} \)85.5
IRM\( {84.8} \pm {1.3} \)\( {76.4} \pm {1.1} \)\( {96.7} \pm {0.6} \)\( {76.1} \pm {1.0} \)83.5
GroupDRO\( {83.5} \pm {0.9} \)\( {79.1} \pm {0.6} \)\( {96.7} \pm {0.3} \)\( {78.3} \pm {2.0} \)84.4
Mixup\( {86.1} \pm {0.5} \)\( {78.9} \pm {0.8} \)\( {97.6} \pm {0.1} \)\( {75.8} \pm {1.8} \)84.6
MLDG\( {85.5} \pm {1.4} \)\( {80.1} \pm {1.7} \)\( {97.4} \pm {0.3} \)\( {76.6} \pm {1.1} \)84.9
CORAL\( {88.3} \pm {0.2} \)\( {80.0} \pm {0.5} \)\( {97.5} \pm {0.3} \)\( {78.8} \pm {1.3} \)86.2
MMD\( {86.1} \pm {1.4} \)\( {79.4} \pm {0.9} \)\( {96.6} \pm {0.2} \)\( {76.5} \pm {0.5} \)84.6
DANN\( {86.4} \pm {0.8} \)\( {77.4} \pm {0.8} \)\( {97.3} \pm {0.4} \)\( {73.5} \pm {2.3} \)83.6
CDANN\( {84.6} \pm {1.8} \)\( {75.5} \pm {0.9} \)\( {96.8} \pm {0.3} \)\( {73.5} \pm {0.6} \)82.6
MTL\( {87.5} \pm {0.8} \)\( {77.1} \pm {0.5} \)\( {96.4} \pm {0.8} \)\( {77.3} \pm {1.8} \)84.6
SagNet\( {87.4} \pm {1.0} \)\( {80.7} \pm {0.6} \)\( {97.1} \pm {0.1} \)\( {80.0} \pm {0.4} \)86.3
ARM\( {86.8} \pm {0.6} \)\( {76.8} \pm {0.5} \)\( {97.4} \pm {0.3} \)\( {79.3} \pm {1.2} \)85.1
VREx\( {86.0} \pm {1.6} \)\( {79.1} \pm {0.6} \)\( {96.9} \pm {0.5} \)\( {77.7} \pm {1.7} \)84.9
RSC\( {85.4} \pm {0.8} \)\( {79.7} \pm {1.8} \)\( {97.6} \pm {0.3} \)\( {78.2} \pm {1.2} \)85.2
DivCAM-S\( {86.2} \pm {1.4} \)\( {79.1} \pm {2.2} \)\( {97.3} \pm {0.4} \)\( {79.2} \pm {0.1} \)85.4
D-TRANSFORMERS\( {86.9} \pm {0.8} \)\( {78.2} \pm {1.7} \)\( {96.6} \pm {0.7} \)\( {75.1} \pm {0.5} \)84.2
ERM*\( {86.5} \pm {1.0} \)\( {81.3} \pm {0.6} \)\( {96.2} \pm {0.3} \)\( {82.7} \pm {1.1} \)86.7
IRM*\( {84.2} \pm {0.9} \)\( {79.7} \pm {1.5} \)\( {95.9} \pm {0.4} \)\( {78.3} \pm {2.1} \)84.5
GroupDRO*\( {87.5} \pm {0.5} \)\( {82.9} \pm {0.6} \)\( {97.1} \pm {0.3} \)\( {81.1} \pm {1.2} \)87.1
Mixup*\( {87.5} \pm {0.4} \)\( {81.6} \pm {0.7} \)\( {97.4} \pm {0.2} \)\( {80.8} \pm {0.9} \)86.8
MLDG*\( {87.0} \pm {1.2} \)\( {82.5} \pm {0.9} \)\( {96.7} \pm {0.3} \)\( {81.2} \pm {0.6} \)86.8
CORAL*\( {86.6} \pm {0.8} \)\( {81.8} \pm {0.9} \)\( {97.1} \pm {0.5} \)\( {82.7} \pm {0.6} \)87.1
MMD*\( {88.1} \pm {0.8} \)\( {82.6} \pm {0.7} \)\( {97.1} \pm {0.5} \)\( {81.2} \pm {1.2} \)87.2
DANN*\( {87.0} \pm {0.4} \)\( {80.3} \pm {0.6} \)\( {96.8} \pm {0.3} \)\( {76.9} \pm {1.1} \)85.2
CDANN*\( {87.7} \pm {0.6} \)\( {80.7} \pm {1.2} \)\( {97.3} \pm {0.4} \)\( {77.6} \pm {1.5} \)85.8
MTL*\( {87.0} \pm {0.2} \)\( {82.7} \pm {0.8} \)\( {96.5} \pm {0.7} \)\( {80.5} \pm {0.8} \)86.7
SagNet*\( {87.4} \pm {0.5} \)\( {81.2} \pm {1.2} \)\( {96.3} \pm {0.8} \)\( {80.7} \pm {1.1} \)86.4
ARM*\( {85.0} \pm {1.2} \)\( {81.4} \pm {0.2} \)\( {95.9} \pm {0.3} \)\( {80.9} \pm {0.5} \)85.8
VREx*\( {87.8} \pm {1.2} \)\( {81.8} \pm {0.7} \)\( {97.4} \pm {0.2} \)\( {82.1} \pm {0.7} \)87.2
RSC*\( {86.0} \pm {0.7} \)\( {81.8} \pm {0.9} \)\( {96.8} \pm {0.7} \)\( {80.4} \pm {0.5} \)86.2
DivCAM-S*\( {86.5} \pm {0.4} \)\( {83.0} \pm {0.5} \)\( {97.2} \pm {0.3} \)\( {82.2} \pm {0.1} \)87.2
D-TRANSFORMERS*\( {87.8} \pm {0.6} \)\( {81.6} \pm {0.3} \)\( {97.2} \pm {0.5} \)\( {80.9} \pm {0.5} \)86.9
算法AC\( \mathbf{P} \)S\( \mathbf{{Avg}.} \)
经验风险最小化(ERM)\( {84.7} \pm {0.4} \)\( {80.8} \pm {0.6} \)\( {97.2} \pm {0.3} \)\( {79.3} \pm {1.0} \)85.5
不变风险最小化(IRM)\( {84.8} \pm {1.3} \)\( {76.4} \pm {1.1} \)\( {96.7} \pm {0.6} \)\( {76.1} \pm {1.0} \)83.5
群体分布鲁棒优化(GroupDRO)\( {83.5} \pm {0.9} \)\( {79.1} \pm {0.6} \)\( {96.7} \pm {0.3} \)\( {78.3} \pm {2.0} \)84.4
混合增强(Mixup)\( {86.1} \pm {0.5} \)\( {78.9} \pm {0.8} \)\( {97.6} \pm {0.1} \)\( {75.8} \pm {1.8} \)84.6
元学习领域泛化(MLDG)\( {85.5} \pm {1.4} \)\( {80.1} \pm {1.7} \)\( {97.4} \pm {0.3} \)\( {76.6} \pm {1.1} \)84.9
相关对齐(CORAL)\( {88.3} \pm {0.2} \)\( {80.0} \pm {0.5} \)\( {97.5} \pm {0.3} \)\( {78.8} \pm {1.3} \)86.2
最大均值差异(MMD)\( {86.1} \pm {1.4} \)\( {79.4} \pm {0.9} \)\( {96.6} \pm {0.2} \)\( {76.5} \pm {0.5} \)84.6
领域对抗神经网络(DANN)\( {86.4} \pm {0.8} \)\( {77.4} \pm {0.8} \)\( {97.3} \pm {0.4} \)\( {73.5} \pm {2.3} \)83.6
条件领域对抗神经网络(CDANN)\( {84.6} \pm {1.8} \)\( {75.5} \pm {0.9} \)\( {96.8} \pm {0.3} \)\( {73.5} \pm {0.6} \)82.6
多任务学习(MTL)\( {87.5} \pm {0.8} \)\( {77.1} \pm {0.5} \)\( {96.4} \pm {0.8} \)\( {77.3} \pm {1.8} \)84.6
语义-属性网络(SagNet)\( {87.4} \pm {1.0} \)\( {80.7} \pm {0.6} \)\( {97.1} \pm {0.1} \)\( {80.0} \pm {0.4} \)86.3
自适应风险最小化(ARM)\( {86.8} \pm {0.6} \)\( {76.8} \pm {0.5} \)\( {97.4} \pm {0.3} \)\( {79.3} \pm {1.2} \)85.1
方差风险扩展(VREx)\( {86.0} \pm {1.6} \)\( {79.1} \pm {0.6} \)\( {96.9} \pm {0.5} \)\( {77.7} \pm {1.7} \)84.9
表示自适应剪枝(RSC)\( {85.4} \pm {0.8} \)\( {79.7} \pm {1.8} \)\( {97.6} \pm {0.3} \)\( {78.2} \pm {1.2} \)85.2
多样性类激活映射-小型(DivCAM-S)\( {86.2} \pm {1.4} \)\( {79.1} \pm {2.2} \)\( {97.3} \pm {0.4} \)\( {79.2} \pm {0.1} \)85.4
D-变换器(D-TRANSFORMERS)\( {86.9} \pm {0.8} \)\( {78.2} \pm {1.7} \)\( {96.6} \pm {0.7} \)\( {75.1} \pm {0.5} \)84.2
经验风险最小化(ERM)*\( {86.5} \pm {1.0} \)\( {81.3} \pm {0.6} \)\( {96.2} \pm {0.3} \)\( {82.7} \pm {1.1} \)86.7
不变风险最小化(IRM)*\( {84.2} \pm {0.9} \)\( {79.7} \pm {1.5} \)\( {95.9} \pm {0.4} \)\( {78.3} \pm {2.1} \)84.5
群体分布鲁棒优化(GroupDRO)*\( {87.5} \pm {0.5} \)\( {82.9} \pm {0.6} \)\( {97.1} \pm {0.3} \)\( {81.1} \pm {1.2} \)87.1
混合增强(Mixup)*\( {87.5} \pm {0.4} \)\( {81.6} \pm {0.7} \)\( {97.4} \pm {0.2} \)\( {80.8} \pm {0.9} \)86.8
元学习领域泛化(MLDG)*\( {87.0} \pm {1.2} \)\( {82.5} \pm {0.9} \)\( {96.7} \pm {0.3} \)\( {81.2} \pm {0.6} \)86.8
相关对齐(CORAL)*\( {86.6} \pm {0.8} \)\( {81.8} \pm {0.9} \)\( {97.1} \pm {0.5} \)\( {82.7} \pm {0.6} \)87.1
最大均值差异(MMD)*\( {88.1} \pm {0.8} \)\( {82.6} \pm {0.7} \)\( {97.1} \pm {0.5} \)\( {81.2} \pm {1.2} \)87.2
领域对抗神经网络(DANN)*\( {87.0} \pm {0.4} \)\( {80.3} \pm {0.6} \)\( {96.8} \pm {0.3} \)\( {76.9} \pm {1.1} \)85.2
条件领域对抗神经网络(CDANN)*\( {87.7} \pm {0.6} \)\( {80.7} \pm {1.2} \)\( {97.3} \pm {0.4} \)\( {77.6} \pm {1.5} \)85.8
多任务学习(MTL)*\( {87.0} \pm {0.2} \)\( {82.7} \pm {0.8} \)\( {96.5} \pm {0.7} \)\( {80.5} \pm {0.8} \)86.7
语义-属性网络(SagNet)*\( {87.4} \pm {0.5} \)\( {81.2} \pm {1.2} \)\( {96.3} \pm {0.8} \)\( {80.7} \pm {1.1} \)86.4
自适应风险最小化(ARM)*\( {85.0} \pm {1.2} \)\( {81.4} \pm {0.2} \)\( {95.9} \pm {0.3} \)\( {80.9} \pm {0.5} \)85.8
方差风险扩展(VREx)*\( {87.8} \pm {1.2} \)\( {81.8} \pm {0.7} \)\( {97.4} \pm {0.2} \)\( {82.1} \pm {0.7} \)87.2
表示自适应剪枝(RSC)*\( {86.0} \pm {0.7} \)\( {81.8} \pm {0.9} \)\( {96.8} \pm {0.7} \)\( {80.4} \pm {0.5} \)86.2
多样性类激活映射-小型(DivCAM-S)*\( {86.5} \pm {0.4} \)\( {83.0} \pm {0.5} \)\( {97.2} \pm {0.3} \)\( {82.2} \pm {0.1} \)87.2
D-变换器(D-TRANSFORMERS)*\( {87.8} \pm {0.6} \)\( {81.6} \pm {0.3} \)\( {97.2} \pm {0.5} \)\( {80.9} \pm {0.5} \)86.9

Table A.2: Domain specific performance for the PACS dataset using training-domain validation (top) and oracle validation denoted with * (bottom). We use a ResNet-50 backbone, optimize with ADAM, and follow the distributions specified in DOMAINBED. Only RSC and our methods have been added as part of this work, the other baselines are taken from DOMAINBED.

表A.2:PACS数据集在训练域验证(上方)和带*标记的oracle验证(下方)下的领域特定性能。我们使用ResNet-50主干网络,采用ADAM优化器,并遵循DOMAINBED中指定的分布。仅RSC和我们的方法是本工作新增,其他基线均取自DOMAINBED。

AlgorithmACP\( \mathbf{R} \)\( \mathbf{{Avg}.} \)
ERM\( {61.3} \pm {0.7} \)\( {52.4} \pm {0.3} \)\( {75.8} \pm {0.1} \)\( {76.6} \pm {0.3} \)66.5
IRM\( {58.9} \pm {2.3} \)\( {52.2} \pm {1.6} \)\( {72.1} \pm {2.9} \)\( {74.0} \pm {2.5} \)64.3
GroupDRO\( {60.4} \pm {0.7} \)\( {52.7} \pm {1.0} \)\( {75.0} \pm {0.7} \)\( {76.0} \pm {0.7} \)66.0
Mixup\( {62.4} \pm {0.8} \)\( {54.8} \pm {0.6} \)\( {76.9} \pm {0.3} \)\( {78.3} \pm {0.2} \)68.1
MLDG\( {61.5} \pm {0.9} \)\( {53.2} \pm {0.6} \)\( {75.0} \pm {1.2} \)\( {77.5} \pm {0.4} \)66.8
CORAL\( {65.3} \pm {0.4} \)\( {54.4} \pm {0.5} \)\( {76.5} \pm {0.1} \)\( {78.4} \pm {0.5} \)68.7
MMD\( {60.4} \pm {0.2} \)\( {53.3} \pm {0.3} \)\( {74.3} \pm {0.1} \)\( {77.4} \pm {0.6} \)66.3
DANN\( {59.9} \pm {1.3} \)\( {53.0} \pm {0.3} \)\( {73.6} \pm {0.7} \)\( {76.9} \pm {0.5} \)65.9
CDANN\( {61.5} \pm {1.4} \)\( {50.4} \pm {2.4} \)\( {74.4} \pm {0.9} \)\( {76.6} \pm {0.8} \)65.8
MTL\( {61.5} \pm {0.7} \)\( {52.4} \pm {0.6} \)\( {74.9} \pm {0.4} \)\( {76.8} \pm {0.4} \)66.4
SagNet\( {63.4} \pm {0.2} \)\( {54.8} \pm {0.4} \)\( {75.8} \pm {0.4} \)\( {78.3} \pm {0.3} \)68.1
ARM\( {58.9} \pm {0.8} \)\( {51.0} \pm {0.5} \)\( {74.1} \pm {0.1} \)\( {75.2} \pm {0.3} \)64.8
VREx\( {60.7} \pm {0.9} \)\( {53.0} \pm {0.9} \)\( {75.3} \pm {0.1} \)\( {76.6} \pm {0.5} \)66.4
RSC\( {60.7} \pm {1.4} \)\( {51.4} \pm {0.3} \)\( {74.8} \pm {1.1} \)\( {75.1} \pm {1.3} \)65.5
DivCAM-S\( {59.5} \pm {0.3} \)\( {49.7} \pm {0.1} \)\( {75.4} \pm {0.7} \)\( {76.2} \pm {0.2} \)65.2
ERM*\( {61.7} \pm {0.7} \)\( {53.4} \pm {0.3} \)\( {74.1} \pm {0.4} \)\( {76.2} \pm {0.6} \)66.4
IRM*\( {56.4} \pm {3.2} \)\( {51.2} \pm {2.3} \)\( {71.7} \pm {2.7} \)\( {72.7} \pm {2.7} \)63.0
GroupDRO*\( {60.5} \pm {1.6} \)\( {53.1} \pm {0.3} \)\( {75.5} \pm {0.3} \)\( {75.9} \pm {0.7} \)66.2
Mixup*\( {63.5} \pm {0.2} \)\( {54.6} \pm {0.4} \)\( {76.0} \pm {0.3} \)\( {78.0} \pm {0.7} \)68.0
MLDG*\( {60.5} \pm {0.7} \)\( {54.2} \pm {0.5} \)\( {75.0} \pm {0.2} \)\( {76.7} \pm {0.5} \)66.6
CORAL*\( {64.8} \pm {0.8} \)\( {54.1} \pm {0.9} \)\( {76.5} \pm {0.4} \)\( {78.2} \pm {0.4} \)68.4
MMD*\( {60.4} \pm {1.0} \)\( {53.4} \pm {0.5} \)\( {74.9} \pm {0.1} \)\( {76.1} \pm {0.7} \)66.2
DANN*\( {60.6} \pm {1.4} \)\( {51.8} \pm {0.7} \)\( {73.4} \pm {0.5} \)\( {75.5} \pm {0.9} \)65.3
CDANN*\( {57.9} \pm {0.2} \)\( {52.1} \pm {1.2} \)\( {74.9} \pm {0.7} \)\( {76.2} \pm {0.2} \)65.3
MTL*\( {60.7} \pm {0.8} \)\( {53.5} \pm {1.3} \)\( {75.2} \pm {0.6} \)\( {76.6} \pm {0.6} \)66.5
SagNet*\( {62.7} \pm {0.5} \)\( {53.6} \pm {0.5} \)\( {76.0} \pm {0.3} \)\( {77.8} \pm {0.1} \)67.5
ARM*\( {58.8} \pm {0.5} \)\( {51.8} \pm {0.7} \)\( {74.0} \pm {0.1} \)\( {74.4} \pm {0.2} \)64.8
VREx*\( {59.6} \pm {1.0} \)\( {53.3} \pm {0.3} \)\( {73.2} \pm {0.5} \)\( {76.6} \pm {0.4} \)65.7
RSC*\( {61.7} \pm {0.8} \)\( {53.0} \pm {0.9} \)\( {74.8} \pm {0.8} \)\( {76.3} \pm {0.5} \)66.5
DIVCAM-S*\( {58.4} \pm {1.1} \)\( {52.7} \pm {0.7} \)\( {74.3} \pm {0.5} \)\( {75.2} \pm {0.2} \)65.2
算法ACP\( \mathbf{R} \)\( \mathbf{{Avg}.} \)
经验风险最小化(ERM)\( {61.3} \pm {0.7} \)\( {52.4} \pm {0.3} \)\( {75.8} \pm {0.1} \)\( {76.6} \pm {0.3} \)66.5
不变风险最小化(IRM)\( {58.9} \pm {2.3} \)\( {52.2} \pm {1.6} \)\( {72.1} \pm {2.9} \)\( {74.0} \pm {2.5} \)64.3
群体分布鲁棒优化(GroupDRO)\( {60.4} \pm {0.7} \)\( {52.7} \pm {1.0} \)\( {75.0} \pm {0.7} \)\( {76.0} \pm {0.7} \)66.0
混合增强(Mixup)\( {62.4} \pm {0.8} \)\( {54.8} \pm {0.6} \)\( {76.9} \pm {0.3} \)\( {78.3} \pm {0.2} \)68.1
元学习领域泛化(MLDG)\( {61.5} \pm {0.9} \)\( {53.2} \pm {0.6} \)\( {75.0} \pm {1.2} \)\( {77.5} \pm {0.4} \)66.8
相关对齐(CORAL)\( {65.3} \pm {0.4} \)\( {54.4} \pm {0.5} \)\( {76.5} \pm {0.1} \)\( {78.4} \pm {0.5} \)68.7
最大均值差异(MMD)\( {60.4} \pm {0.2} \)\( {53.3} \pm {0.3} \)\( {74.3} \pm {0.1} \)\( {77.4} \pm {0.6} \)66.3
领域对抗神经网络(DANN)\( {59.9} \pm {1.3} \)\( {53.0} \pm {0.3} \)\( {73.6} \pm {0.7} \)\( {76.9} \pm {0.5} \)65.9
条件领域对抗神经网络(CDANN)\( {61.5} \pm {1.4} \)\( {50.4} \pm {2.4} \)\( {74.4} \pm {0.9} \)\( {76.6} \pm {0.8} \)65.8
多任务学习(MTL)\( {61.5} \pm {0.7} \)\( {52.4} \pm {0.6} \)\( {74.9} \pm {0.4} \)\( {76.8} \pm {0.4} \)66.4
语义-属性网络(SagNet)\( {63.4} \pm {0.2} \)\( {54.8} \pm {0.4} \)\( {75.8} \pm {0.4} \)\( {78.3} \pm {0.3} \)68.1
自适应风险最小化(ARM)\( {58.9} \pm {0.8} \)\( {51.0} \pm {0.5} \)\( {74.1} \pm {0.1} \)\( {75.2} \pm {0.3} \)64.8
方差风险扩展(VREx)\( {60.7} \pm {0.9} \)\( {53.0} \pm {0.9} \)\( {75.3} \pm {0.1} \)\( {76.6} \pm {0.5} \)66.4
表示自适应剪枝(RSC)\( {60.7} \pm {1.4} \)\( {51.4} \pm {0.3} \)\( {74.8} \pm {1.1} \)\( {75.1} \pm {1.3} \)65.5
多样性类激活映射-小(DivCAM-S)\( {59.5} \pm {0.3} \)\( {49.7} \pm {0.1} \)\( {75.4} \pm {0.7} \)\( {76.2} \pm {0.2} \)65.2
经验风险最小化(ERM)*\( {61.7} \pm {0.7} \)\( {53.4} \pm {0.3} \)\( {74.1} \pm {0.4} \)\( {76.2} \pm {0.6} \)66.4
不变风险最小化(IRM)*\( {56.4} \pm {3.2} \)\( {51.2} \pm {2.3} \)\( {71.7} \pm {2.7} \)\( {72.7} \pm {2.7} \)63.0
群体分布鲁棒优化(GroupDRO)*\( {60.5} \pm {1.6} \)\( {53.1} \pm {0.3} \)\( {75.5} \pm {0.3} \)\( {75.9} \pm {0.7} \)66.2
混合增强(Mixup)*\( {63.5} \pm {0.2} \)\( {54.6} \pm {0.4} \)\( {76.0} \pm {0.3} \)\( {78.0} \pm {0.7} \)68.0
元学习领域泛化(MLDG)*\( {60.5} \pm {0.7} \)\( {54.2} \pm {0.5} \)\( {75.0} \pm {0.2} \)\( {76.7} \pm {0.5} \)66.6
相关对齐(CORAL)*\( {64.8} \pm {0.8} \)\( {54.1} \pm {0.9} \)\( {76.5} \pm {0.4} \)\( {78.2} \pm {0.4} \)68.4
最大均值差异(MMD)*\( {60.4} \pm {1.0} \)\( {53.4} \pm {0.5} \)\( {74.9} \pm {0.1} \)\( {76.1} \pm {0.7} \)66.2
领域对抗神经网络(DANN)*\( {60.6} \pm {1.4} \)\( {51.8} \pm {0.7} \)\( {73.4} \pm {0.5} \)\( {75.5} \pm {0.9} \)65.3
条件领域对抗神经网络(CDANN)*\( {57.9} \pm {0.2} \)\( {52.1} \pm {1.2} \)\( {74.9} \pm {0.7} \)\( {76.2} \pm {0.2} \)65.3
多任务学习(MTL)*\( {60.7} \pm {0.8} \)\( {53.5} \pm {1.3} \)\( {75.2} \pm {0.6} \)\( {76.6} \pm {0.6} \)66.5
语义-属性网络(SagNet)*\( {62.7} \pm {0.5} \)\( {53.6} \pm {0.5} \)\( {76.0} \pm {0.3} \)\( {77.8} \pm {0.1} \)67.5
自适应风险最小化(ARM)*\( {58.8} \pm {0.5} \)\( {51.8} \pm {0.7} \)\( {74.0} \pm {0.1} \)\( {74.4} \pm {0.2} \)64.8
方差风险扩展(VREx)*\( {59.6} \pm {1.0} \)\( {53.3} \pm {0.3} \)\( {73.2} \pm {0.5} \)\( {76.6} \pm {0.4} \)65.7
表示自适应剪枝(RSC)*\( {61.7} \pm {0.8} \)\( {53.0} \pm {0.9} \)\( {74.8} \pm {0.8} \)\( {76.3} \pm {0.5} \)66.5
多样性类激活映射-小(DIVCAM-S)*\( {58.4} \pm {1.1} \)\( {52.7} \pm {0.7} \)\( {74.3} \pm {0.5} \)\( {75.2} \pm {0.2} \)65.2

Table A.3: Domain specific performance for the Office-Home dataset using training-domain validation (top) and oracle validation denoted with * (bottom). We use a ResNet-50 backbone, optimize with ADAM, and follow the distributions specified in DOMAINBED. Only RSC and our methods have been added as part of this work, the other baselines are taken from DOMAINBED.

表A.3:使用训练域验证(上方)和带*标记的oracle验证(下方)对Office-Home数据集的领域特定性能。我们采用ResNet-50骨干网络,使用ADAM优化器,并遵循DOMAINBED中指定的分布。仅RSC和我们的方法作为本工作新增,其他基线均取自DOMAINBED。

AlgorithmL100L38L43L46\( \mathbf{{Avg}.} \)
ERM\( {49.8} \pm {4.4} \)\( {42.1} \pm {1.4} \)\( {56.9} \pm {1.8} \)\( {35.7} \pm {3.9} \)46.1
IRM\( {54.6} \pm {1.3} \)\( {39.8} \pm {1.9} \)\( {56.2} \pm {1.8} \)\( {39.6} \pm {0.8} \)47.6
GroupDRO\( {41.2} \pm {0.7} \)\( {38.6} \pm {2.1} \)\( {56.7} \pm {0.9} \)\( {36.4} \pm {2.1} \)43.2
Mixup\( {59.6} \pm {2.0} \)\( {42.2} \pm {1.4} \)\( {55.9} \pm {0.8} \)\( {33.9} \pm {1.4} \)47.9
MLDG\( {54.2} \pm {3.0} \)\( {44.3} \pm {1.1} \)\( {55.6} \pm {0.3} \)\( {36.9} \pm {2.2} \)47.7
CORAL\( {51.6} \pm {2.4} \)\( {42.2} \pm {1.0} \)\( {57.0} \pm {1.0} \)\( {39.8} \pm {2.9} \)47.6
MMD\( {41.9} \pm {3.0} \)\( {34.8} \pm {1.0} \)\( {57.0} \pm {1.9} \)\( {35.2} \pm {1.8} \)42.2
DANN\( {51.1} \pm {3.5} \)\( {40.6} \pm {0.6} \)\( {57.4} \pm {0.5} \)\( {37.7} \pm {1.8} \)46.7
CDANN\( {47.0} \pm {1.9} \)\( {41.3} \pm {4.8} \)\( {54.9} \pm {1.7} \)\( {39.8} \pm {2.3} \)45.8
MTL\( {49.3} \pm {1.2} \)\( {39.6} \pm {6.3} \)\( {55.6} \pm {1.1} \)\( {37.8} \pm {0.8} \)45.6
SagNet\( {53.0} \pm {2.9} \)\( {43.0} \pm {2.5} \)\( {57.9} \pm {0.6} \)\( {40.4} \pm {1.3} \)48.6
ARM\( {49.3} \pm {0.7} \)\( {38.3} \pm {2.4} \)\( {55.8} \pm {0.8} \)\( {38.7} \pm {1.3} \)45.5
VREx\( {48.2} \pm {4.3} \)\( {41.7} \pm {1.3} \)\( {56.8} \pm {0.8} \)\( {38.7} \pm {3.1} \)46.4
RSC\( {50.2} \pm {2.2} \)\( {39.2} \pm {1.4} \)\( {56.3} \pm {1.4} \)\( {40.8} \pm {0.6} \)46.6
DivCAM-S\( {51.6} \pm {2.2} \)\( {44.4} \pm {2.1} \)\( {55.2} \pm {1.7} \)\( {40.7} \pm {2.6} \)48.0
D-TRANSFORMERS\( {53.7} \pm {1.0} \)\( {29.4} \pm {2.5} \)\( {53.9} \pm {1.0} \)\( {34.5} \pm {3.1} \)42.9
ERM*\( {59.4} \pm {0.9} \)\( {49.3} \pm {0.6} \)\( {60.1} \pm {1.1} \)\( {43.2} \pm {0.5} \)53.0
IRM*\( {56.5} \pm {2.5} \)\( {49.8} \pm {1.5} \)\( {57.1} \pm {2.2} \)\( {38.6} \pm {1.0} \)50.5
GroupDRO*\( {60.4} \pm {1.5} \)\( {48.3} \pm {0.4} \)\( {58.6} \pm {0.8} \)\( {42.2} \pm {0.8} \)52.4
Mixup*\( {67.6} \pm {1.8} \)\( {51.0} \pm {1.3} \)\( {59.0} \pm {0.0} \)\( {40.0} \pm {1.1} \)54.4
MLDG*\( {59.2} \pm {0.1} \)\( {49.0} \pm {0.9} \)\( {58.4} \pm {0.9} \)\( {41.4} \pm {1.0} \)52.0
CORAL*\( {60.4} \pm {0.9} \)\( {47.2} \pm {0.5} \)\( {59.3} \pm {0.4} \)\( {44.4} \pm {0.4} \)52.8
MMD*\( {60.6} \pm {1.1} \)\( {45.9} \pm {0.3} \)\( {57.8} \pm {0.5} \)\( {43.8} \pm {1.2} \)52.0
DANN*\( {55.2} \pm {1.9} \)\( {47.0} \pm {0.7} \)\( {57.2} \pm {0.9} \)\( {42.9} \pm {0.9} \)50.6
CDANN*\( {56.3} \pm {2.0} \)\( {47.1} \pm {0.9} \)\( {57.2} \pm {1.1} \)\( {42.4} \pm {0.8} \)50.8
MTL*\( {58.4} \pm {2.1} \)\( {48.4} \pm {0.8} \)\( {58.9} \pm {0.6} \)\( {43.0} \pm {1.3} \)52.2
SagNet*\( {56.4} \pm {1.9} \)\( {50.5} \pm {2.3} \)\( {59.1} \pm {0.5} \)\( {44.1} \pm {0.6} \)52.5
ARM*\( {60.1} \pm {1.5} \)\( {48.3} \pm {1.6} \)\( {55.3} \pm {0.6} \)\( {40.9} \pm {1.1} \)51.2
VREx*\( {56.8} \pm {1.7} \)\( {46.5} \pm {0.5} \)\( {58.4} \pm {0.3} \)\( {43.8} \pm {0.3} \)51.4
RSC*\( {59.9} \pm {1.4} \)\( {46.7} \pm {0.4} \)\( {57.8} \pm {0.5} \)\( {44.3} \pm {0.6} \)52.1
DivCAM-S*\( {57.7} \pm {1.1} \)\( {46.0} \pm {1.3} \)\( {58.9} \pm {0.4} \)\( {42.5} \pm {0.7} \)51.3
D-TRANSFORMERS*\( {59.7} \pm {0.6} \)\( {51.1} \pm {1.4} \)\( {56.5} \pm {0.4} \)\( {42.2} \pm {1.0} \)52.4
算法L100L38L43L46\( \mathbf{{Avg}.} \)
经验风险最小化(ERM)\( {49.8} \pm {4.4} \)\( {42.1} \pm {1.4} \)\( {56.9} \pm {1.8} \)\( {35.7} \pm {3.9} \)46.1
不变风险最小化(IRM)\( {54.6} \pm {1.3} \)\( {39.8} \pm {1.9} \)\( {56.2} \pm {1.8} \)\( {39.6} \pm {0.8} \)47.6
群体分布鲁棒优化(GroupDRO)\( {41.2} \pm {0.7} \)\( {38.6} \pm {2.1} \)\( {56.7} \pm {0.9} \)\( {36.4} \pm {2.1} \)43.2
混合增强(Mixup)\( {59.6} \pm {2.0} \)\( {42.2} \pm {1.4} \)\( {55.9} \pm {0.8} \)\( {33.9} \pm {1.4} \)47.9
元学习领域泛化(MLDG)\( {54.2} \pm {3.0} \)\( {44.3} \pm {1.1} \)\( {55.6} \pm {0.3} \)\( {36.9} \pm {2.2} \)47.7
相关对齐(CORAL)\( {51.6} \pm {2.4} \)\( {42.2} \pm {1.0} \)\( {57.0} \pm {1.0} \)\( {39.8} \pm {2.9} \)47.6
最大均值差异(MMD)\( {41.9} \pm {3.0} \)\( {34.8} \pm {1.0} \)\( {57.0} \pm {1.9} \)\( {35.2} \pm {1.8} \)42.2
领域对抗神经网络(DANN)\( {51.1} \pm {3.5} \)\( {40.6} \pm {0.6} \)\( {57.4} \pm {0.5} \)\( {37.7} \pm {1.8} \)46.7
条件领域对抗神经网络(CDANN)\( {47.0} \pm {1.9} \)\( {41.3} \pm {4.8} \)\( {54.9} \pm {1.7} \)\( {39.8} \pm {2.3} \)45.8
多任务学习(MTL)\( {49.3} \pm {1.2} \)\( {39.6} \pm {6.3} \)\( {55.6} \pm {1.1} \)\( {37.8} \pm {0.8} \)45.6
语义-属性网络(SagNet)\( {53.0} \pm {2.9} \)\( {43.0} \pm {2.5} \)\( {57.9} \pm {0.6} \)\( {40.4} \pm {1.3} \)48.6
自适应风险最小化(ARM)\( {49.3} \pm {0.7} \)\( {38.3} \pm {2.4} \)\( {55.8} \pm {0.8} \)\( {38.7} \pm {1.3} \)45.5
方差风险扩展(VREx)\( {48.2} \pm {4.3} \)\( {41.7} \pm {1.3} \)\( {56.8} \pm {0.8} \)\( {38.7} \pm {3.1} \)46.4
表示自适应剪枝(RSC)\( {50.2} \pm {2.2} \)\( {39.2} \pm {1.4} \)\( {56.3} \pm {1.4} \)\( {40.8} \pm {0.6} \)46.6
多样性类激活映射-小型(DivCAM-S)\( {51.6} \pm {2.2} \)\( {44.4} \pm {2.1} \)\( {55.2} \pm {1.7} \)\( {40.7} \pm {2.6} \)48.0
D-变换器(D-TRANSFORMERS)\( {53.7} \pm {1.0} \)\( {29.4} \pm {2.5} \)\( {53.9} \pm {1.0} \)\( {34.5} \pm {3.1} \)42.9
经验风险最小化(ERM)*\( {59.4} \pm {0.9} \)\( {49.3} \pm {0.6} \)\( {60.1} \pm {1.1} \)\( {43.2} \pm {0.5} \)53.0
不变风险最小化(IRM)*\( {56.5} \pm {2.5} \)\( {49.8} \pm {1.5} \)\( {57.1} \pm {2.2} \)\( {38.6} \pm {1.0} \)50.5
群体分布鲁棒优化(GroupDRO)*\( {60.4} \pm {1.5} \)\( {48.3} \pm {0.4} \)\( {58.6} \pm {0.8} \)\( {42.2} \pm {0.8} \)52.4
混合增强(Mixup)*\( {67.6} \pm {1.8} \)\( {51.0} \pm {1.3} \)\( {59.0} \pm {0.0} \)\( {40.0} \pm {1.1} \)54.4
元学习领域泛化(MLDG)*\( {59.2} \pm {0.1} \)\( {49.0} \pm {0.9} \)\( {58.4} \pm {0.9} \)\( {41.4} \pm {1.0} \)52.0
相关对齐(CORAL)*\( {60.4} \pm {0.9} \)\( {47.2} \pm {0.5} \)\( {59.3} \pm {0.4} \)\( {44.4} \pm {0.4} \)52.8
最大均值差异(MMD)*\( {60.6} \pm {1.1} \)\( {45.9} \pm {0.3} \)\( {57.8} \pm {0.5} \)\( {43.8} \pm {1.2} \)52.0
领域对抗神经网络(DANN)*\( {55.2} \pm {1.9} \)\( {47.0} \pm {0.7} \)\( {57.2} \pm {0.9} \)\( {42.9} \pm {0.9} \)50.6
条件领域对抗神经网络(CDANN)*\( {56.3} \pm {2.0} \)\( {47.1} \pm {0.9} \)\( {57.2} \pm {1.1} \)\( {42.4} \pm {0.8} \)50.8
多任务学习(MTL)*\( {58.4} \pm {2.1} \)\( {48.4} \pm {0.8} \)\( {58.9} \pm {0.6} \)\( {43.0} \pm {1.3} \)52.2
语义-属性网络(SagNet)*\( {56.4} \pm {1.9} \)\( {50.5} \pm {2.3} \)\( {59.1} \pm {0.5} \)\( {44.1} \pm {0.6} \)52.5
自适应风险最小化(ARM)*\( {60.1} \pm {1.5} \)\( {48.3} \pm {1.6} \)\( {55.3} \pm {0.6} \)\( {40.9} \pm {1.1} \)51.2
方差风险扩展(VREx)*\( {56.8} \pm {1.7} \)\( {46.5} \pm {0.5} \)\( {58.4} \pm {0.3} \)\( {43.8} \pm {0.3} \)51.4
表示自适应剪枝(RSC)*\( {59.9} \pm {1.4} \)\( {46.7} \pm {0.4} \)\( {57.8} \pm {0.5} \)\( {44.3} \pm {0.6} \)52.1
多样性类激活映射-小型(DivCAM-S)*\( {57.7} \pm {1.1} \)\( {46.0} \pm {1.3} \)\( {58.9} \pm {0.4} \)\( {42.5} \pm {0.7} \)51.3
D-变换器(D-TRANSFORMERS)*\( {59.7} \pm {0.6} \)\( {51.1} \pm {1.4} \)\( {56.5} \pm {0.4} \)\( {42.2} \pm {1.0} \)52.4

Table A.4: Domain specific performance for the Terra Incognita dataset using training-domain validation (top) and oracle validation denoted with * (bottom). We use a ResNet-50 backbone, optimize with ADAM, and follow the distributions specified in DOMAINBED. Only RSC and our methods have been added as part of this work, the other baselines are taken from DomAINBED.

表A.4:使用训练域验证(上方)和带*标记的oracle验证(下方)对Terra Incognita数据集的特定领域性能评估。我们采用ResNet-50主干网络,使用ADAM优化器,并遵循DOMAINBED中指定的分布。仅RSC和我们的方法作为本工作新增,其他基线均取自DOMAINBED。

AlgorithmclipinfopaintquickrealsketchAvg.
ERM\( {58.1} \pm {0.3} \)\( {18.8} \pm {0.3} \)\( {46.7} \pm {0.3} \)\( {12.2} \pm {0.4} \)\( {59.6} \pm {0.1} \)\( {49.8} \pm {0.4} \)40.9
IRM\( {48.5} \pm {2.8} \)\( {15.0} \pm {1.5} \)\( {38.3} \pm {4.3} \)\( {10.9} \pm {0.5} \)\( {48.2} \pm {5.2} \)\( {42.3} \pm {3.1} \)33.9
GroupDRO\( {47.2} \pm {0.5} \)\( {17.5} \pm {0.4} \)\( {33.8} \pm {0.5} \)\( {9.3} \pm {0.3} \)\( {51.6} \pm {0.4} \)\( {40.1} \pm {0.6} \)33.3
Mixup\( {55.7} \pm {0.3} \)\( {18.5} \pm {0.5} \)\( {44.3} \pm {0.5} \)\( {12.5} \pm {0.4} \)\( {55.8} \pm {0.3} \)\( {48.2} \pm {0.5} \)39.2
MLDG\( {59.1} \pm {0.2} \)\( {19.1} \pm {0.3} \)\( {45.8} \pm {0.7} \)\( {13.4} \pm {0.3} \)\( {59.6} \pm {0.2} \)\( {50.2} \pm {0.4} \)41.2
CORAL\( {59.2} \pm {0.1} \)\( {19.7} \pm {0.2} \)\( {46.6} \pm {0.3} \)\( {13.4} \pm {0.4} \)\( {59.8} \pm {0.2} \)\( {50.1} \pm {0.6} \)41.5
MMD\( {32.1} \pm {13.3} \)\( {11.0} \pm {4.6} \)\( {26.8} \pm {11.3} \)\( {8.7} \pm {2.1} \)\( {32.7} \pm {13.8} \)\( {28.9} \pm {11.9} \)23.4
DANN\( {53.1} \pm {0.2} \)\( {18.3} \pm {0.1} \)\( {44.2} \pm {0.7} \)\( {11.8} \pm {0.1} \)\( {55.5} \pm {0.4} \)\( {46.8} \pm {0.6} \)38.3
CDANN\( {54.6} \pm {0.4} \)\( {17.3} \pm {0.1} \)\( {43.7} \pm {0.9} \)\( {12.1} \pm {0.7} \)\( {56.2} \pm {0.4} \)\( {45.9} \pm {0.5} \)38.3
MTL\( {57.9} \pm {0.5} \)\( {18.5} \pm {0.4} \)\( {46.0} \pm {0.1} \)\( {12.5} \pm {0.1} \)\( {59.5} \pm {0.3} \)\( {49.2} \pm {0.1} \)40.6
SagNet\( {57.7} \pm {0.3} \)\( {19.0} \pm {0.2} \)\( {45.3} \pm {0.3} \)\( {12.7} \pm {0.5} \)\( {58.1} \pm {0.5} \)\( {48.8} \pm {0.2} \)40.3
ARM\( {49.7} \pm {0.3} \)\( {16.3} \pm {0.5} \)\( {40.9} \pm {1.1} \)\( {9.4} \pm {0.1} \)\( {53.4} \pm {0.4} \)\( {43.5} \pm {0.4} \)35.5
VREx\( {47.3} \pm {3.5} \)\( {16.0} \pm {1.5} \)\( {35.8} \pm {4.6} \)\( {10.9} \pm {0.3} \)\( {49.6} \pm {4.9} \)\( {42.0} \pm {3.0} \)33.6
RSC\( {55.0} \pm {1.2} \)\( {18.3} \pm {0.5} \)\( {44.4} \pm {0.6} \)\( {12.2} \pm {0.2} \)\( {55.7} \pm {0.7} \)\( {47.8} \pm {0.9} \)38.9
DIVCAM-S\( {57.7} \pm {0.3} \)\( {19.3} \pm {0.3} \)\( {46.8} \pm {0.2} \)\( {12.7} \pm {0.4} \)\( {58.9} \pm {0.2} \)\( {48.5} \pm {0.4} \)40.7
ERM*\( {58.6} \pm {0.3} \)\( {19.2} \pm {0.2} \)\( {47.0} \pm {0.3} \)\( {13.2} \pm {0.2} \)\( {59.9} \pm {0.3} \)\( {49.8} \pm {0.4} \)41.3
IRM*\( {40.4} \pm {6.6} \)\( {12.1} \pm {2.7} \)\( {31.4} \pm {5.7} \)\( {9.8} \pm {1.2} \)\( {37.7} \pm {9.0} \)\( {36.7} \pm {5.3} \)28.0
GroupDRO*\( {47.2} \pm {0.5} \)\( {17.5} \pm {0.4} \)\( {34.2} \pm {0.3} \)\( {9.2} \pm {0.4} \)\( {51.9} \pm {0.5} \)\( {40.1} \pm {0.6} \)33.4
Mixup*\( {55.6} \pm {0.1} \)\( {18.7} \pm {0.4} \)\( {45.1} \pm {0.5} \)\( {12.8} \pm {0.3} \)\( {57.6} \pm {0.5} \)\( {48.2} \pm {0.4} \)39.6
MLDG*\( {59.3} \pm {0.1} \)\( {19.6} \pm {0.2} \)\( {46.8} \pm {0.2} \)\( {13.4} \pm {0.2} \)\( {60.1} \pm {0.4} \)\( {50.4} \pm {0.3} \)41.6
CORAL*\( {59.2} \pm {0.1} \)\( {19.9} \pm {0.2} \)\( {47.4} \pm {0.2} \)\( {14.0} \pm {0.4} \)\( {59.8} \pm {0.2} \)\( {50.4} \pm {0.4} \)41.8
MMD*\( {32.2} \pm {13.3} \)\( {11.2} \pm {4.5} \)\( {26.8} \pm {11.3} \)\( {8.8} \pm {2.2} \)\( {32.7} \pm {13.8} \)\( {29.0} \pm {11.8} \)23.5
DANN*\( {53.1} \pm {0.2} \)\( {18.3} \pm {0.1} \)\( {44.2} \pm {0.7} \)\( {11.9} \pm {0.1} \)\( {55.5} \pm {0.4} \)\( {46.8} \pm {0.6} \)38.3
CDANN*\( {54.6} \pm {0.4} \)\( {17.3} \pm {0.1} \)\( {44.2} \pm {0.7} \)\( {12.8} \pm {0.2} \)\( {56.2} \pm {0.4} \)\( {45.9} \pm {0.5} \)38.5
MTL*\( {58.0} \pm {0.4} \)\( {19.2} \pm {0.2} \)\( {46.2} \pm {0.1} \)\( {12.7} \pm {0.2} \)\( {59.9} \pm {0.1} \)\( {49.0} \pm {0.0} \)40.8
SagNet*\( {57.7} \pm {0.3} \)\( {19.1} \pm {0.1} \)\( {46.3} \pm {0.5} \)\( {13.5} \pm {0.4} \)\( {58.9} \pm {0.4} \)\( {49.5} \pm {0.2} \)40.8
ARM*\( {49.6} \pm {0.4} \)\( {16.5} \pm {0.3} \)\( {41.5} \pm {0.8} \)\( {10.8} \pm {0.1} \)\( {53.5} \pm {0.3} \)\( {43.9} \pm {0.4} \)36.0
VREx*\( {43.3} \pm {4.5} \)\( {14.1} \pm {1.8} \)\( {32.5} \pm {5.0} \)\( {9.8} \pm {1.1} \)\( {43.5} \pm {5.6} \)\( {37.7} \pm {4.5} \)30.1
RSC*\( {55.0} \pm {1.2} \)\( {18.3} \pm {0.5} \)\( {44.4} \pm {0.6} \)\( {12.5} \pm {0.1} \)\( {55.7} \pm {0.7} \)\( {47.8} \pm {0.9} \)38.9
DivCAM-S*\( {57.8} \pm {0.2} \)\( {19.3} \pm {0.3} \)\( {47.0} \pm {0.1} \)\( {13.1} \pm {0.3} \)\( {59.6} \pm {0.1} \)\( {48.9} \pm {0.2} \)41.0
算法剪辑信息绘画快速真实素描平均
ERM(经验风险最小化)\( {58.1} \pm {0.3} \)\( {18.8} \pm {0.3} \)\( {46.7} \pm {0.3} \)\( {12.2} \pm {0.4} \)\( {59.6} \pm {0.1} \)\( {49.8} \pm {0.4} \)40.9
IRM(不变风险最小化)\( {48.5} \pm {2.8} \)\( {15.0} \pm {1.5} \)\( {38.3} \pm {4.3} \)\( {10.9} \pm {0.5} \)\( {48.2} \pm {5.2} \)\( {42.3} \pm {3.1} \)33.9
GroupDRO(组分布鲁棒优化)\( {47.2} \pm {0.5} \)\( {17.5} \pm {0.4} \)\( {33.8} \pm {0.5} \)\( {9.3} \pm {0.3} \)\( {51.6} \pm {0.4} \)\( {40.1} \pm {0.6} \)33.3
Mixup(混合增强)\( {55.7} \pm {0.3} \)\( {18.5} \pm {0.5} \)\( {44.3} \pm {0.5} \)\( {12.5} \pm {0.4} \)\( {55.8} \pm {0.3} \)\( {48.2} \pm {0.5} \)39.2
MLDG(元学习领域泛化)\( {59.1} \pm {0.2} \)\( {19.1} \pm {0.3} \)\( {45.8} \pm {0.7} \)\( {13.4} \pm {0.3} \)\( {59.6} \pm {0.2} \)\( {50.2} \pm {0.4} \)41.2
CORAL(相关对齐)\( {59.2} \pm {0.1} \)\( {19.7} \pm {0.2} \)\( {46.6} \pm {0.3} \)\( {13.4} \pm {0.4} \)\( {59.8} \pm {0.2} \)\( {50.1} \pm {0.6} \)41.5
MMD(最大均值差异)\( {32.1} \pm {13.3} \)\( {11.0} \pm {4.6} \)\( {26.8} \pm {11.3} \)\( {8.7} \pm {2.1} \)\( {32.7} \pm {13.8} \)\( {28.9} \pm {11.9} \)23.4
DANN(域对抗神经网络)\( {53.1} \pm {0.2} \)\( {18.3} \pm {0.1} \)\( {44.2} \pm {0.7} \)\( {11.8} \pm {0.1} \)\( {55.5} \pm {0.4} \)\( {46.8} \pm {0.6} \)38.3
CDANN(条件域对抗神经网络)\( {54.6} \pm {0.4} \)\( {17.3} \pm {0.1} \)\( {43.7} \pm {0.9} \)\( {12.1} \pm {0.7} \)\( {56.2} \pm {0.4} \)\( {45.9} \pm {0.5} \)38.3
MTL(多任务学习)\( {57.9} \pm {0.5} \)\( {18.5} \pm {0.4} \)\( {46.0} \pm {0.1} \)\( {12.5} \pm {0.1} \)\( {59.5} \pm {0.3} \)\( {49.2} \pm {0.1} \)40.6
SagNet(语义-属性分离网络)\( {57.7} \pm {0.3} \)\( {19.0} \pm {0.2} \)\( {45.3} \pm {0.3} \)\( {12.7} \pm {0.5} \)\( {58.1} \pm {0.5} \)\( {48.8} \pm {0.2} \)40.3
ARM(自适应风险最小化)\( {49.7} \pm {0.3} \)\( {16.3} \pm {0.5} \)\( {40.9} \pm {1.1} \)\( {9.4} \pm {0.1} \)\( {53.4} \pm {0.4} \)\( {43.5} \pm {0.4} \)35.5
VREx(方差风险外推)\( {47.3} \pm {3.5} \)\( {16.0} \pm {1.5} \)\( {35.8} \pm {4.6} \)\( {10.9} \pm {0.3} \)\( {49.6} \pm {4.9} \)\( {42.0} \pm {3.0} \)33.6
RSC(随机子空间校正)\( {55.0} \pm {1.2} \)\( {18.3} \pm {0.5} \)\( {44.4} \pm {0.6} \)\( {12.2} \pm {0.2} \)\( {55.7} \pm {0.7} \)\( {47.8} \pm {0.9} \)38.9
DIVCAM-S\( {57.7} \pm {0.3} \)\( {19.3} \pm {0.3} \)\( {46.8} \pm {0.2} \)\( {12.7} \pm {0.4} \)\( {58.9} \pm {0.2} \)\( {48.5} \pm {0.4} \)40.7
ERM*\( {58.6} \pm {0.3} \)\( {19.2} \pm {0.2} \)\( {47.0} \pm {0.3} \)\( {13.2} \pm {0.2} \)\( {59.9} \pm {0.3} \)\( {49.8} \pm {0.4} \)41.3
IRM*\( {40.4} \pm {6.6} \)\( {12.1} \pm {2.7} \)\( {31.4} \pm {5.7} \)\( {9.8} \pm {1.2} \)\( {37.7} \pm {9.0} \)\( {36.7} \pm {5.3} \)28.0
GroupDRO*\( {47.2} \pm {0.5} \)\( {17.5} \pm {0.4} \)\( {34.2} \pm {0.3} \)\( {9.2} \pm {0.4} \)\( {51.9} \pm {0.5} \)\( {40.1} \pm {0.6} \)33.4
Mixup*\( {55.6} \pm {0.1} \)\( {18.7} \pm {0.4} \)\( {45.1} \pm {0.5} \)\( {12.8} \pm {0.3} \)\( {57.6} \pm {0.5} \)\( {48.2} \pm {0.4} \)39.6
MLDG*\( {59.3} \pm {0.1} \)\( {19.6} \pm {0.2} \)\( {46.8} \pm {0.2} \)\( {13.4} \pm {0.2} \)\( {60.1} \pm {0.4} \)\( {50.4} \pm {0.3} \)41.6
CORAL*\( {59.2} \pm {0.1} \)\( {19.9} \pm {0.2} \)\( {47.4} \pm {0.2} \)\( {14.0} \pm {0.4} \)\( {59.8} \pm {0.2} \)\( {50.4} \pm {0.4} \)41.8
MMD*\( {32.2} \pm {13.3} \)\( {11.2} \pm {4.5} \)\( {26.8} \pm {11.3} \)\( {8.8} \pm {2.2} \)\( {32.7} \pm {13.8} \)\( {29.0} \pm {11.8} \)23.5
DANN*\( {53.1} \pm {0.2} \)\( {18.3} \pm {0.1} \)\( {44.2} \pm {0.7} \)\( {11.9} \pm {0.1} \)\( {55.5} \pm {0.4} \)\( {46.8} \pm {0.6} \)38.3
CDANN*\( {54.6} \pm {0.4} \)\( {17.3} \pm {0.1} \)\( {44.2} \pm {0.7} \)\( {12.8} \pm {0.2} \)\( {56.2} \pm {0.4} \)\( {45.9} \pm {0.5} \)38.5
MTL*\( {58.0} \pm {0.4} \)\( {19.2} \pm {0.2} \)\( {46.2} \pm {0.1} \)\( {12.7} \pm {0.2} \)\( {59.9} \pm {0.1} \)\( {49.0} \pm {0.0} \)40.8
SagNet*\( {57.7} \pm {0.3} \)\( {19.1} \pm {0.1} \)\( {46.3} \pm {0.5} \)\( {13.5} \pm {0.4} \)\( {58.9} \pm {0.4} \)\( {49.5} \pm {0.2} \)40.8
ARM*\( {49.6} \pm {0.4} \)\( {16.5} \pm {0.3} \)\( {41.5} \pm {0.8} \)\( {10.8} \pm {0.1} \)\( {53.5} \pm {0.3} \)\( {43.9} \pm {0.4} \)36.0
VREx*\( {43.3} \pm {4.5} \)\( {14.1} \pm {1.8} \)\( {32.5} \pm {5.0} \)\( {9.8} \pm {1.1} \)\( {43.5} \pm {5.6} \)\( {37.7} \pm {4.5} \)30.1
RSC*\( {55.0} \pm {1.2} \)\( {18.3} \pm {0.5} \)\( {44.4} \pm {0.6} \)\( {12.5} \pm {0.1} \)\( {55.7} \pm {0.7} \)\( {47.8} \pm {0.9} \)38.9
DivCAM-S*\( {57.8} \pm {0.2} \)\( {19.3} \pm {0.3} \)\( {47.0} \pm {0.1} \)\( {13.1} \pm {0.3} \)\( {59.6} \pm {0.1} \)\( {48.9} \pm {0.2} \)41.0

Table A.5: Domain specific performance for the DomainNet dataset using training-domain validation (top) and oracle validation denoted with * (bottom). We use a ResNet-50 backbone, optimize with ADAM, and follow the distributions specified in DOMAINBED. Only RSC and our methods have been added as part of this work, the other baselines are taken from DOMAINBED.

表A.5:使用训练域验证(上方)和带*标记的oracle验证(下方)对DomainNet数据集的特定领域性能评估。我们采用ResNet-50主干网络,使用ADAM优化器,并遵循DOMAINBED中指定的分布。仅将RSC和我们的方法作为本工作新增,其他基线均取自DOMAINBED。

Additional distance plots

额外的距离图

Figure B.1: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=1.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. No self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. First data split.

图B.1:最佳模型在每个测试域上的成对学习原型2-距离(上方)和余弦距离ϱ(下方),该模型带有负权重wc,j=1.0j:pjPc。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。未应用自我挑战,且为便于可视化,各指标的色彩图界限均已调整。第一次数据划分。

Figure B.2: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=1.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. No self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. Third data split.

图B.2:最佳模型在每个测试域上的成对学习原型2-距离(上方)和余弦距离ϱ(下方),该模型带有负权重wc,j=1.0j:pjPc。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。未应用自我挑战,且为便于可视化,各指标的色彩图界限均已调整。第三次数据划分。

Figure B.3: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=1.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. Self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. First data split.

图B.3:最佳模型在每个测试域上的成对学习原型2-距离(上方)和余弦距离ϱ(下方),该模型带有负权重wc,j=1.0j:pjPc。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。已应用自我挑战,且为便于可视化,各指标的色彩图界限均已调整。第一次数据划分。

Figure B.4: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=1.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. Self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. Third data split.

图B.4:最佳模型在每个测试域上的成对学习原型2-距离(上方)和余弦距离ϱ(下方),该模型带有负权重wc,j=1.0j:pjPc。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。已应用自我挑战,且为便于可视化,各指标的色彩图界限均已调整。第三次数据划分。

Figure B.5: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=0.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. No self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. First data split.

图B.5:最佳模型在每个测试域上的成对学习原型2-距离(上方)和余弦距离ϱ(下方),该模型带有负权重wc,j=0.0j:pjPc。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。未应用自我挑战,且为便于可视化,各指标的色彩图界限均已调整。第一次数据划分。

Figure B.6: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=0.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. No self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. Second data split.

图B.6:最佳模型在每个测试域上的成对学习原型2-距离(上方)和余弦距离ϱ(下方),该模型带有负权重wc,j=0.0j:pjPc。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。未应用自我挑战,且为便于可视化,各指标的色彩图界限均已调整。第二次数据划分。

Figure B.7: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=0.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. No self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. Third data split.

图B.7:最佳模型在每个测试域上的成对学习原型2-距离(上方)和余弦距离ϱ(下方),该模型带有负权重wc,j=0.0j:pjPc。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。未应用自我挑战,且为便于可视化,各指标的色彩图界限均已调整。第三次数据划分。

Figure B.8: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=0.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. Self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. First data split.

图B.8:最佳表现模型在每个测试域中负权重wc,j=0.0j:pjPc下的成对学习原型2距离(上)和余弦距离ϱ(下)。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。应用了自我挑战(self-challenging),并根据指标调整了色彩图界限以便可视化。第一数据划分。

Figure B.9: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=0.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. Self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. Second data split.

图B.9:最佳表现模型在每个测试域中负权重wc,j=0.0j:pjPc下的成对学习原型2距离(上)和余弦距离ϱ(下)。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。应用了自我挑战(self-challenging),并根据指标调整了色彩图界限以便可视化。第二数据划分。

Figure B.10: Pairwise learned prototype 2 -distance (top) and cosine-distance ϱ (bottom) of the best-performing model with negative weight wc,j=0.0j:pjPc for each testing domain. Red squares denote prototype class correspondence for the 7 different classes in the PACS dataset. Self-challenging is applied and colormap bounds are adjusted per metric for visualization purposes. Third data split.

图B.10:最佳表现模型在每个测试域中负权重wc,j=0.0j:pjPc下的成对学习原型2距离(上)和余弦距离ϱ(下)。红色方块表示PACS数据集中7个不同类别的原型类别对应关系。应用了自我挑战(self-challenging),并根据指标调整了色彩图界限以便可视化。第三数据划分。